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PREFACE TO THE FIRST EDITION 


This book has been written primarily for those who have little or no 

previous training in mathematical statistics, but who have some training 
or experience in the presentation and handling of statistical data. It is 
consequently not written in the form of a mathematical treatise, and 
mathematical proofs have not been included. On the other hand an attempt 
has been made to cover all the modern developments of sampling theory 
which are of importance in census and survey work, and to give an adequate 
discussion of the complexities that are encountered in their practical application. 
This has necessitated fuller treatment of the subject than is to be found in 
textbooks on mathematical statistics, or than is normally included in statistical 
courses. Indeed, the orderly development imposed by the preparation of a 
book revealed a number of gaps in current theory which had to be filled in. 
Consequently the book should also prove of value to mathematical statisticians 
who are interested in sampling theory and its applications. 

The work had its origin in a request of the United Nations Sub-Commission 
on Statistical Sampling, at their first session held at Lake Success in September, 
1947, that a manual be prepared to assist in the execution of the projected 
1950 World Census of Agriculture, and the 1950 World Census of Population. 
‘The Sub-Commission were particularly impressed with the need for a wider 
use of sampling in the less developed areas, and it was originally intended 
that only the sampling problems encountered in censuses and surveys in these 
areas should be dealt with. On reviewing the matter, however, I came to the 
conclusion that conditions differed so greatly in different areas that it would 
be necessary to cover a wide variety of methods, which in essentials differed 
little from the methods appropriate to censuses and surveys in more fully 
developed areas. It therefore seemed best to take the opportunity of writing 
a more general book. I believe that on balance the course taken will be 
advantageous to those concerned with censuses and surveys in the less developed 
areas, since the modern developments in sampling have been chiefly made in 
Conjunction with its application in the more fully developed areas, and they 
can best be explained against the background of the material and problems 
to which they have been applied. 

The various computational procedures have been illustrated, as far as 

-Practicable, by numerical examples. These examples in the main have an 
agricultura] background, since this type of data was most readily accessible 
and is also particularly relevant to the original purpose of the book. For the 
most part the data on which they are based form a small part of the results 
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of much larger surveys. The examples do not in themselves serve as models 
for the reduction of large bodies of data, but once the general principles have 
been grasped no great difficulty should be found in planning this reduction, 
which presents very similar problems to those encountered in the analysis 
of material from complete censuses and surveys. 

I have not attempted to ascribe priority in the discovery of particular 
methods. Indeed, such a task presents almost insuperable difficulties, since 
the methods used in many surveys are not at all fully reported, and the main 
developments have arisen chiefly through ingenious practical workers devising 
new methods of selection which seemed on commonsense grounds to be capable 


of giving specially accurate results, or appeared to possess other valuable 
properties. i 
* * * 


. I have been much helped by my wife in the planning of the book, and 
I have had considerable assistance in its preparation from various members 
of the Rothamsted Statistical Department. My thanks are particularly due 
to Dr. Rose O. Cashen and Mr. H. D. Patterson for computing and checking 
many of the examples, to Dr. P. M. Grundy and Mr. G. M. Jolly for their 
critical reading of the galley proofs, and to Miss Ruth Hunt and other members 
of the secretarial and computing staff for all their work and assistance. I also 
wish to thank the publishers and printers for the care taken in the preparation 
of this book in spite of the great haste that was necessary. 


F. YATES 


ROTHAMSTED EXPERIMENTAL STATION, 
Tth February, 1949, 
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PREFACE TO THE SECOND EDITION 


Two new chapters (9 and 10) have been added in the second edition, These 
amplify certain aspects not fully dealt with in the first edition and contain 
accounts of various recent developments. Some space is also devoted to prob- 
lems arising in the analysis of investigational surveys. Chapter 9 is of a 
fairly advanced nature, but most of Chapter 10 will, I think, be found fairly 
easy reading. 

Apart from the correction of a few errors, the text of Chapters 1-8 remains 
unchanged, but references to the new matter have been inserted where necessary. 
In adopting this course I have been influenced by three considerations. First, 
that the material added would gain little by being incorporated in the body 
of the book. Second, the use of the standing type substantially reduces the 
cost and labour of revision and at the same time ensures that no new errors 
are introduced into the original text. Third, those familiar with the first 
edition will probably find it more convenient to have the new matter all assembled 
in one place. 

Translations of the first edition into French and Japanese have already 
appeared. The reception awarded to the first edition, and the influence it 
had amongst practical workers, have more than gratified my 


appears to have I 
hope that it would serve a useful purpose in encouraging a wider adoption of 
sound techniques. The United Nations Sub-Commission on Statistical 
Sampling, .to whose deliberations the book owes so much, was wound up after 
its meeting in Calcutta at the beginning of 1952. In all it held five sessions. 
The rapid adoption of sampling techniques in many parts of the world where 
they had hitherto been unknown, and the very marked improvement in survey 
practice, will serve as a lasting monument to its labours. 

For those who require a short introduction to the subject, particularly in 
its statistical aspects, I have appended a list of sections for first reading (p. xvi). 
If these are thoroughly mastered the reader should have a reasonable grasp 
of the main types of sampling and the basic statistical theory. He can then 
amplify and extend this knowledge as need arises. 

My thanks for assistance in the preparation of this edition are due to Mr. 
H. D. Patterson, Mr. F. B. Leech and Mr. P. R. D. Avis, to my secretary 
Miss Ruth Hunt, without whose diligent attention to detail many errors and 


omissions would have occurred, and to Mr. D. H. Rees for the preparation 


of the supplementary bibliography. F. YATES 


ROTHAMSTED EXPERIMENTAL STATION, 
30th June, 1953. 
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SAMPLING METHODS FOR 
CENSUSES AND SURVEYS 


CHAPTER 1 
THE PLACE OF SAMPLING IN CENSUS WORK 


1.1 The sampling process 

Sampling, that is, the selection of part of an aggregate of material to 
represent the whole aggregate, is a long-established practice. Simple examples 
are provided by a handful of grain taken from a sack, or a piece of cloth cut 
off a roll. In these cases little attention need be paid to the selection process 
since the whole of the material is similar or well-mixed, and any part of it 
if not too small is likely to be closely representative of the whole. When, 
however, the aggregate to be sampled consists of units which are somewhat 


dissimilar amongst themselves, and which are not well-mixed, a small sample 
hole aggregate. Even if units 


of these units may not be representative of the w 
are selected from different parts of the aggregate, and other suitable precautions 
are taken, the sample is likely to a certain extent to be unrepresentative owing 
s the chance inclusion of an undue proportion of units of a particular type. 
t will clearly not be representative if units of a particular type are chosen 
deliberately to the exclusion of other types, or if the process of selection is 
such that certain types of unit are favoured at the expense of others. Thus 
in sampling a heap of coal by taking a few shovelfuls from the edges, too great 
a proportion of the large lumps will be obtained, since the large lumps tend 
to roll down the sides and be distributed round the edges of the heap. 
Similarly in the sampling of continuous material, a single portion, even if 
quite large, may not be adequately representative ; a piece of cloth cut off 
the end of a roll in which the quality of the weaving varies progressively, will 
not form an adequate sample of the whole roll. 
Census and survey work is normally carried out on material made up of 
dissimilar units. Censuses of population, censuses of industrial production, 
and censuses of agriculture have the common feature that the aggregate of 
material embraces a large number of separate units which are often markedly 
dissimilar in various respects. In many cases the purposes for which the 
information is required are adequately served if a proportion only of the units 
are covered, but because of the dissimilarity of the different units neither 
l 1 
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haphazard nor casual selection, and still less deliberate selection, can be 
expected to provide a representative sample. Rigorous processes of selection 
have therefore to be used. 

Censuses carried out on a properly selected sample will be called sample 
censuses. There has in the past been a tendency to use the term sample to 
refer to the results of an attempted complete census in which there has been 
failure to obtain information from a substantial proportion of the units. Its 
use in this sense is strongly to be deprecated ; instead the term incomplete 
census is suggested. The term sample should be reserved for a set of units 
or portion of an aggregate of material which has been selected in the belief 
that it will be representative of the whole aggregate. 


1.2 Sampling errors 


Whether or not a sample will give results which are sufficiently 
representative of the whole aggregate depends primarily on whether the errors 
introduced by the sampling process are sufficiently small not to invalidate 
the results for the purposes for which they are required. Even if a proper 
process of selection is employed, the sample cannot be exactly representative 
of the whole aggregate. The inevitable errors which then occur in the 
results are termed the random sampling errors of these results. The average 
magnitude of these random sampling errors will depend on the size of the 
sample, on the variability of the material, on the sampling procedure adopted, 
and on the way in which the results are calculated. 

It is a fortunate fact that if a proper process of selection is adopted, the 
average magnitude of the random sampling errors, and indeed the expected 
frequency of occurrence of errors of any magnitude, can be calculated from 
the detailed results obtained from an actual sample. The methods by which 
this can be done depend on the mathematical theory of statistical sampling. 

An extension of the analysis involved in the calculation of these errors 
enables the relative accuracy of the different sampling methods which can be 
employed on the same material to be assessed, and thus enables further surveys 
to be more efficiently planned. 

It is the development of these processes that has changed sampling from 
a speculative and uncertain procedure to a method having definite and 
determinable precision. Sampling has thus become a reliable method in which 
full confidence can be placed. In addition, the possibility of setting ascertainable 
limits to the random sampling errors has served to throw into prominence 
those other types of error which arise from faulty selection processes or faulty 
aaa of observation, or which exist in some other source of information 
with which the sampling results are being compared. 


1.3 The place of sampling in census and survey work 


Sampling will only be of use in census work if, 


‘ 1 » as mentioned in Section 1.2. 
the sampling errors are sufficiently small not to affect the Validity of the meló 
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for the purposes for which they are required. This will in part be a function 
of the degree to which the results have to be broken down. If only overall 
results for the whole population are required, a given degree of accuracy will 
be attained with a far smaller sample than will be the case if detailed results 
for different parts of the population (e.g. different regions, towns, etc.) are 
required. In certain circumstances the sample may have to be so large that 
there will be little point in using a sample census in place of a complete 
census. Obviously, in the extreme case where information on all the individual 
units is required, this can only be obtained by a complete census. 

Another factor which influences the decision whether or not to use sampling 
is the relative difficulty and cost of organizing a sample census and a complete 
census. ‘The amount of effort and expense required to collect information 
is always greater per unit for a sample than for a complete census. In addition 
a sample census presents its own organization problems, some of which are 
absent from a complete census, and it occasionally happens, if the information 
required is very simple, that a complete census can be carried out through the 
ordinary administrative channels, whereas a sample census requires the 
setting-up of a separate organization. Usually, however, if the size of the 
sample needed to give the required accuracy represents only a small fraction 
of the whole population, the total effort and expense required to collect the 
information by sampling methods will be very much less than that required 
for a census of the whole population. 

In many cases, therefore, sampling results in great economy of effort. 
It has also other advantages which are not so immediately apparent. In the 
first place, the completeness and accuracy of the returns may be much more 
easily ensured if the information is collected from only a small proportion 
of the population. If, for example, questionnaires are sent through the post, 
it is frequently impossible in a complete census to bring pressure to bear on 
those who fail to make their returns, even where the completion of these 
questionnaires is compulsory, owing to the large numbers of individuals 
involved. In the case of a sample, the smaller number of individuals enables 
follow-up notices to be sent and telephone calls and visits to be made. The 
separate returns can also be much more carefully scrutinized, and further 
enquiries undertaken where there is reason to doubt their accuracy. 

Secondly, it is possible to obtain more detailed information in a sample 
census. Although the burden on the individual of furnishing more detailed 
information is not lessened, except when different items of information can 
be obtained from different individuals, the individuals concerned are more 
likely to be willing to provide such information if they know that they represent 
a small sample of the whole population. Detailed information, when obtained, 
can be more easily handled, both at the stage of abstraction and coding of the 
original information and in the analysis of the coded results. Owing to the 
reduced volume of material that has to be handled the quality of the abstraction 
and analysis can also be improved, the former because a higher grade of clerical 
labour can be employed, with better supervision, and the latter because the 
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data can be classified in many more ways with the same amount of computing 
or machine time. . . 

Thirdly, in many types of census the use of sampling makes possible 
a very considerable increase in speed, both in the execution of the field work, 
and in the analysis of the results. Speed in analysis can also be obtained, in 
the case of a complete census, by taking a sample of the returns for abstraction 
and analysis. This device is frequently of value for providing preliminary 
results quickly, even when a final analysis of the whole of the returns is 
ultimately required. 

The use of sampling is essential for investigations of the sociological type 
in which extensive and detailed information has to be collected from individuals, 
many of whom have neither the education nor experience required to answer 
detailed questionnaires without assistance. It is equally essential in 
investigations requiring skilled physical observations and measurements. 
Such investigations can only be carried out by the use of trained investigators, 
and complete investigations covering any large group of the population or body 
of material are consequently impossible, both on grounds of expense and 
because, even if the expense can be tolerated, a sufficient body of investigators 
can rarely be recruited and trained. 

For an investigation of this kind involving the collection of elaborate 
information the term survey is usually employed. It seems a mistake, however, 
to confine the word survey to a sample survey or the word census to a complete 
census. Thus B. Seebohm Rowntree (1901), when he carried out an 
investigation into the social and economic conditions of all working-class 
families in York, correctly described this as a survey. 

Although the use of sampling necessarily introduces certain inaccuracies, 
owing to sampling errors, the results obtained by sampling are frequently 
more accurate than those obtained in a complete census or survey. The 
random sampling errors are always assessable. The other errors to which a 
survey is subject, such as incompleteness of returns and inaccuracy of 
information, are liable to be very much more serious in a complete census 
than in a sample census, since far more effective precautions can be taken 
to see that the information is accurate and complete in a sample census. 
Furthermore, the use of sampling greatly facilitates the imposition of additional 
more detailed checks. Indeed, a complete census can only be properly tested 
for accuracy by some form of sampling check. 

On the other hand, the claim that is sometimes made that the reliability 
and accuracy of the results of a properly planned sample census can be 
assessed with full objectivity from the results themselves is only partly true. 
The random sampling errors can be so assessed, and under certain circumstances 
it is possible to obtain comparisons between different investigators. If all 
investigators or respondents tend to make the same kind of error, however, 
this will not be revealed in the results, whether the census is complete or carried 
out on a sample. : me OM 

In respect of coverage a sample census may in certain circumstances be 
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` 
less reliable than a complete census. It is, for example, relatively simple for 
an investigator to ascertain by direct question whether an individual has already 
been included in a population census, and simple intensive checks of certain 
areas, say villages, can be made in a similar manner to verify that there is no 
appreciable number of omissions. Similarly, in a survey of physical objects 
such as houses, a marking system or other suitable device can often be used 
to guard against duplication and omission. Such checks are impossible in 
the case of a sample census. 

This is one of the most difficult points in the practical design of many 
sample surveys, particularly in undeveloped areas. To overcome it complete 
enumeration can sometimes be used in conjunction with sampling. Where a 
complete enumeration of the whole of the population or aggregate of material 
presents no particular difficulty, but where the collection of detailed information 
from all units would be a difficult or impossible undertaking, a complete 
enumeration can be carried out. This is then used as a basis for the selection 
of the sample for such sample censuses and surveys as are required to provide 


the detailed information. 


1.4 Development of the use of sampling in censuses and surveys 


Prior to the development of the appropriate methods of estimation of 
sampling errors and a clear recognition of the conditions governing satisfactory 
methods of selection of the sample, the use of sampling in census and survey 
work often proved unsatisfactory. There are many early examples of sample 
censuses and surveys which are defective in one way or another. Even when 
the basic principles of the simpler forms of sampling were understood, the 
attempted use of more complicated forms before methods of evaluating their 


errors and relative efficiency had been worked out gave rise to further defective 


surveys. $ r ; : ; eer 
This has led to a certain mistrust of sampling, which still exists in some 


quarters. During recent years, however, there has been a rapid growth in 
the use of sampling in various countries. This development has been greatly 
stimulated by the war and its attendant measures of large-scale economic 
control. Such measures, if they were to be effective in the changing conditions 
met with in wartime, demanded an efficient and speedy information service 
which only the sampling method could supply. This has resulted in further 
improvements in technique through the stimulation of research into the theory 
of sampling methods, and the provision of basic data for practical investigations 
of the relative efficiency of the various methods in different fields. 

_ Tt still remains true, however, that in inexperienced hands sampling may 
give unsatisfactory results, owing to the use of faulty methods of selection, 
Inappropriate sampling design, Or inefficient methods of estimation. The prime 
requirement of any large-scale sample survey is therefore that the organization 
of the survey shouldbe carried out by a person who has adequate knowledge 
and experience of sampling methods and their application. The methods 
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employed must be thoroughly sound, theoretically and practically, both in 
order that satisfactory results may be ensured, and also in order that mistrust 
cannot subsequently be engendered by criticism of the methods adopted. 
It must never be forgotten that it is not sufficient to provide results which are 
in fact correct. They must also be generally accepted if they are to have their 
full value. 

It is sometimes stated that no large-scale sample census or survey should 
be carried out without the advice of an expert mathematical statistician with 
experience of such work. Unquestionably, if the services of such an expert 
can be secured this is all to the good, but my own experience is that no one 
expert can be expected to supervise adequately more than a very few surveys 
at any one time, since adequate supervision demands a very full knowledge 
both of the material that is to be surveyed and of local conditions, coupled 
with close attention to detail at all stages. An expert acting in an advisory 
capacity is therefore no substitute for the statistician on the spot, who must be 
prepared to accept responsibility for the planning, execution and analysis of 
the survey. To do this he must himself have both an adequate knowledge 
of sampling procedure and thorough knowledge of the material and local 
conditions. 

Consequently, if full and effective use is to be made of sampling methods, 
statisticians and others who already have experience of the conduct of complete 
censuses but no training in sampling methods must themselves undertake a 
study of these methods, in order that they may decide in what ways these can 
be applied to their own problems. The function of the expert then becomes 
one of advice on exceptional problems, rather than one of detailed supervision. 

Fortunately the principles underlying good sampling methods are not 
unduly difficult to understand, and provided a proper respect is observed for 
the fundamental rules of procedure I believe they can be successfully applied 


by those who have statistical experience but who are not primarily mathematical 
statisticians. 


1.5 Method of presentation 


The method of presentation adopted in this book is to take the various 
parts of the sampling process in roughly the order they are encountered in 
the execution of a census or survey, and discuss the various aspects of each 
part in turn. Thus Chapters 2 and 3 describe the various types of sample 
that can be used, and the general principles to be followed in the selection 
of a sample, Chapter 4 deals with the practical planning of a survey, and 
Chapter 5 with the problems encountered in its execution and in the SDS Eon 
of the results. The remaining chapters are concerned with the more 
statistical problems. Chapter 6 deals with the various methods of estimatin 
the population values, Chapter 7 with the estimation of sampling errors, and 
Chapter 8 with the determination of the relative efficiency of the various 
sampling methods. This method of presentation has the advantage that the 
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more practical aspects of sampling procedure are dealt with first. It is true 
that knowledge of the statistical techniques described in the later chapters is 
necessary before the relative merits of different methods of sampling any 
particular type of material can be accurately assessed. The detailed application 
of these techniques, however, is the province of those responsible for the 
numerical analysis of the results, whereas the planning is also the concern 
of those who require the information and those who are concerned with its 
collection. The planning can be undertaken much more efficiently, and with 
added interest, if all concerned understand in general terms the underlying 
problems. It is hoped that study of the first five chapters will give this under- 
standing. If they also act as a stimulus to the study of Chapter 6 and the 
first few sections of Chapter 7 the understanding should be correspondingly 
deepened. 

For those responsible for the numerical analysis of the results, and for 
the assessment of the relative efficiency of the different possible methods, 
thorough study of the whole book is necessary. This study should include 
the reworking of the numerical examples. Only by this procedure can a 
thorough grasp of the details of the various methods be obtained. 

The separation of the discussion of the methods of estimation of the 
population values, of the sampling errors, and of efficiency necessarily involves 
a good deal of cross-reference, particularly in the numerical examples. Since 
this appeared inevitable, it was with some hesitation that the chosen method 
of presentation was adopted. On balance, however, this disadvantage appeared 
to be outweighed by the advantage of being able to present as a whole the 
relatively simple techniques involved in estimation before the more complicated 
techniques required for the estimation of error and the assessment of relative 
efficiency. It is believed that this will make the book more useful to those 
who do not require to go deeply into these latter techniques. For those who 
prefer it, there is nothing to prevent the simultaneous study of the corresponding 
sections of Chapters 6 and 7, or indeed of Chapters 6, 7 and 8. Chapters 6, 
7 and 8 may also, if desired, be taken before Chapters 4 and 5. 


1.6 Terminology and notation 


The question of terminology was considered by the United Nations Sub- 
Commission on Statistical Sampling, at its second session held in Geneva in 
September, 1948. Their recommendations are included in a memorandum 
entitled Recommendations concerning the Preparation of Reports on Sampling 
Surveys. With a few minor exceptions the terminology adopted in this book 
is that recommended by the Sub-Commission. 

New conventions have been adopted for the mathematical notation. The 
use of bold face and Gill Sans type for population values and their estimates, 
and of capital letters for the population totals, has enabled the formule to be 
presented in a very simple, and it is hoped easily understandable form. By 
the use of this notation the elaborate summation notation which has become 
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current in much of the literature on sampling has been avoided. It is 
recognized that the notation is not particularly convenient for manuscript 
and typescript, but the difficulty can in fact be overcome by the use of single 


and double underlining, with the corresponding verbal descriptions of 
“ sub-bar ” and “ sub-double-bar.” 


CHAPTER 2 


REQUIREMENTS OF A GOOD SAMPLE 


2.1 Bias 

The principal object of any sampling procedure is to secure a sample 
which, subject to limitations of size, will reproduce the characteristics of the 
population, especially those of immediate interest, as closely as possible. 

At first sight it might appear that the most accurate results could be 
obtained by deliberate selection of the units to be included in the sample. 
In particular, if averages only are of interest, units might be selected which 
appear to be nearest to the average. If, for example, a quick assessment of 
the yield per acre of an agricultural crop is required, district officers might 
be asked to select some “ average ” fields in each district, and to determine 
the yields of these fields. 

Such a sample is unfortunately very often of little value. Its primary 
fault is that it may well be biased, that is, the selection of all the fields may 
be affected by similar errors. Thus, in order to enhance the reputation of 
their districts, all district officers may tend to select fields which yield more 
heavily than the average, or, if they feel that the interests of the farmers or the 
country may be furthered by an underestimate, they may select fields which 
yield less than the average. ; 

Even if the district officers can be trusted to be completely objective, 
considerable unconscious errors of judgment, all tending in the same direction, 
may still occur, and such errors may far outweigh any increase in accuracy 
resulting from deliberate selection. Nor will increase in the number of officers 
concerned in the selection necessarily improve matters, since all may be subject 
to the same type of error. : ' 

We may consequently distinguish between two types of sampling error, 
those arising from biases in selection, etc., and those due to chance differences 
between the members of the population included in the sample and those 
not included. The aggregate of the former in the sample will be termed the 
error due to bias and the aggregate of the latter the random sampling error, or 
when bias is known to be absent, the sampling error, The total sampling error 
will, of course, be made up of the bias, if any exists, and the random sampling 
error. The essence of bias is that it forms a constant component of error 
which does not decrease, in a large population, as the number in the sample 
increases, whereas the random sampling error decreases on the average as 


the number in the sample increases. 


2.2 Methods of selection which give rise to bias 
dhereareta number oF ways in which faulty selection of the sample may 
give rise to bias. The main causes may be broadly classified as follows :— 
(1) Deliberate selection of a “‘ representative” sample. This is the type 
of bias described above. 


SECT. 2.3 SAMPLING METHODS FOR CENSUSES AND SURVEYS 


(2) A procedure of selection depending on some characteristic which is 
correlated with properties of the unit which are of interest. Many 
haphazard selection processes give rise to biases of this kind. 


(3) Conscious or unconscious bias in the selection of a “ random ” sample. 
If a proper random process is not strictly adhered to, the investigator, 
although claiming that his sample is random, may allow his desire 
to obtain a certain result to influence his selection. This type of bias 


is particularly serious, since its existence may not be immediately 
apparent. 


(4) Substitution. Investigators often substitute another convenient member 
of the population when difficulties are encountered in obtaining 
information. Thus, in a house-to-house survey the next house 
may be taken when there is no reply. This will necessarily lead to 
a preponderance of houses of the type that are occupied all day, 
e.g. houses of people with families. 


(5) Failure to cover the whole of the chosen sample. If no second visit 
is made to houses from which no reply is received there will still 
be bias even though no substitution is attempted. This fault is 
particularly prevalent in postal questionnaires, which are often very 
incompletely returned. Returns are clearly likely to be received 
from individuals who are specially interested in the objects of the 
survey, or possess other characteristics which make them 
unrepresentative of the whole population. 


2.3 Avoidance of bias in selection 


It is clear that, if possibilities of bias exist, no fully objective conclusions 
can be drawn from a sample. The first essential of any sampling procedure 
must therefore be the elimination of all important sources of bias, 

The simplest, and the only universally certain way, of avoiding bias in 
the selection process is for the sample to'be drawn either entirely at random 
or at random subject to restrictions which, while improving the accuracy, 
are of such a nature that they do not intreduce bias into the results. In Rome 
cases, however, certain forms of systematic selection, such as the selection of 
names at equal intervals down a list, or the use of an evenly spaced grid of 
points on a map, may be permissible. 

Random selection does not mean haphazard selection. A random sample 
can only be obtained by adherence to some proper random process, such as 
the drawing of lots or the use of a table of random numbers. Sticking pins 
into a map will not give a random distribution of points in a map. The 
selection of houses by walking through the streets of a town will not give 
a random selection of houses in the town. The words “ random ” ane 
“random sample” are, in fact, gravely abused. For. this r i 
other, the method of selecting the atthe should be eretitied ‘nal ene: 
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of the results of sample surveys and censuses, and indeed in all sampling work. 

In order to prevent careless or deliberately biased selection on the part 
of investigators it is often important in large-scale work for the selection to 
be done in some central office, in such a manner that no element of choice is 
left to the investigators, and in such a manner also that checks on the field 
work can be imposed if necessary. Even in cases in which a less rigorous 
method of selection may be judged to be satisfactory, it may be necessary 
to impose a rigorous method in order to prevent criticism on this ground by 
those not familiar with the details of the work. 


2.4 Examples of biased selection 


It may be well at this stage to give some actual examples of cases in which 
an unsatisfactory method of selection has introduced serious bias into the 
results. 

The first example is taken from a paper by Kiser (1934, D). A sample of 
households was taken in Syracuse, U.S.A., in 1930 and 1931, with the object 
of making a study of morbidity. It was also intended to use this sample for 
the study of birth-rates. Before beginning this latter study, which was 
subsidiary to the morbidity study, a comparison was made of the sizes of 
households of the sample with those of the corresponding census tracts. This 
comparison is shown in Table 2.4.a. (Households of one were not included 


in the survey.) 

Taper 2.4.a—SAMPLE OF HOUSEHOLDS IN SYRACUSE: DISTRIBUTION OF 
HOUSEHOLDS ACCORDING TO SIZE, IN THE ORIGINAL SAMPLE, AND IN THE 
CENSUS TRACTS 


Original sample Census tracts 

Number in 

Waa Number Per cent.: Number Per cent. 
H 254 19-4 1,762 26-8 
ae : g 338 25-9 1,745 26-5 
ae. 4 307 23-5 1,438 21-9 
a ae 201 15-4 | 853 13-0 
an a 3 106 8-1 388 59 
7 RREA 46 35 208 3-2 
8 r 25 1-9 96 1-5 
9 and over 29 ak ae te 
Tors F306 99-9 6,576 100-1 


It is immediately apparent from the table that the sample contains a 
«considerably Bas OBN of large households than exist in the whole 
‘population. Households of two are under-represented in the sample to the 
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extent of 7-4 per cent. of all households, or 28 per cent. of the households of 
this size. This deficiency is attributed by Kiser to the failure of enumerators 
to revisit missed households, in which childless married women working away 
from home are likely to predominate. In order to provide a more satisfactory 
sample it was necessary to make a further survey of those families that were 
missed altogether at the time of the morbidity survey. 

It is interesting to note that the sample was apparently considered 
satisfactory for the morbidity study, as is indicated by the statement that the 
workers “ had been primarily concerned with securing a sample representative 
of the area in regard to prevalence of sickness rather than size of household.” 
Actually such a biased sample can scarcely be regarded as wholly satisfactory 
for even a morbidity study, since sickness rates are likely to vary with the 
size and composition of the family. 

The second example is one obtained at Rothamsted in an experimental 
sampling of a collection of stones (Yates, 1936, b, H). The stones, a number of 
flints of varying sizes, some 1200 in all, were spread out on a table, and twelve 
observers were each instructed to choose three samples of twenty stones which 
should represent as nearly as possible the size distribution of the whole 
collection. Table 2.4.b gives the mean weights per stone of these 36 samples, 
and also the true mean weight of the whole collection. 


TABLE 2.4.b—MEAN WEIGHT PER STONE IN SAMPLES OF 20 STONES (oz) 


Observer sl 2 3 4 5 6 y | 8 9 10 ll 12 


Sample 1 1:9 24 24 19 22 28 24 16 22 26 24 24 

Sample 2 18 30 24 20 2-7 26 26 20 22 22 24 30 

Sample 3 1:7 24 21 2:0 31 28 25 20 22 31 18 24 

Mean 18 26 23 2:0 2-7 27 265 19 22 26 22 26 
Mean of all samples: 2-34 oz. True mean: 1:91 oz. 


It is apparent that there is a tendency, which is common to most observers, 
to select stones which are on the average larger than those of the whole 
collection. Of the twelve observers ten chose samples whose mean weight 
was above the mean weight, 1-91 oz., of all the stones, the mean for all samples 
being 2-34 oz. This tendency is consistent from sample to sample. Thus, 
of the thirty samples chosen by the above ten observers, all but two had mean 
weights greater than the mean weight of all stones, while all three samples of 
observer 1 were less than the correct mean. 

In this example the selection was deliberate. A further example showing 
similar effects arising from haphazard selection (claimed by the observer to 
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be “ random”) is provided by some observations obtained in the course of 
a scheme of sampling observations on the growth of wheat instituted by the 
Agricultural Meteorological Committee (Yates, 1935, A). 

In this scheme measurements on the heights of shoots of wheat were made 
at regular intervals on observation plots at a number of centres. A detailed 
procedure had been laid down for the random location on each occasion of 
128 quarter-metre lengths of row in sets of 4 on contiguous rows. The height 
measurements were made on the 256 shoots at the ends of these lengths—test 
observations conducted at another time indicated that this method of selection 
was virtually random. At one centre a drill with fewer rows than normal 
had to be used, and as a result only 192 shoots were available for measurement 
on each occasion. In order to provide the number of observations laid down 
and thereby, as he thought, improve his results, the observer selected “ at 
» two additional shoots from each set of three quarter-metre 


random ; Xy 
lengths. Fortunately he booked the observations on these additional shoots 
separately. ee 

Ye 9.4.a and 2.4.b show the distribution of the regular and additional 


measurements taken on the 31st May and on the 28th June respectively. The 
deviations from the set means of the regular measurements are shown. 
Suitable adjustments, details of which are given in the original paper, are 
made to the additional measurements to give fair representation of the 
variability as well as the bias in the mean. Ms 

Examination of Figure 2.4.a indicates that on this date the additional 
measurements show a considerable preponderance of positive deviations with 
a corresponding deficiency of negative deviations. There is, in fact, a tendency 
to select shoots which are higher on the average than those of a truly random 
sample, the difference in the average height being + 3-3 cm. This difference 
is clearly in the nature of a bias, and cannot be attributed to random sampling 
The situation was entirely different on the 28th June, as is shown in 
Figure 2.4.b. At this date the deviations of the additional measurements, 
both positive and negative, are smaller on the average than are those of the 
regular observations ; in other words, there is a tendency to select shoots 
which are nearer the mean height than they would be on the average in a truly 
random sample. In spite of this, there is again a considerable bias, this time 
negative, the mean difference being — 27 cm. In this case, therefore, a single 
additional shoot will give a value which on the average is closer to the true 
mean value than is the value given by a single randomly located shoot, but 
as the number of shoots is increased the relative accuracy of the random sample 
Progressively increases, and with the numbers of shoots actually taken, the 
random sample is considerably more accurate. 

This example provides an illustration of a case where the biases on the 
two occasions, though arising from similar defects in selection, are of very 
different magnitude, and indeed of opposite sign. Consequently the difference 
of the two sets of measurements will also be seriously affected by bias. In 
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Fic. 2.4.a—DISTRIBUTION OF REGULAR OBSERVATIONS (shaded) AND ADDITIONAL 
OBSERVATIONS (unshaded) OF HEIGHTS OF WHEAT SHOOTS ON 31st MAY 


(By courtesy of the editor of the Annals of Eugenics.) 
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Fic, 2.4. b—DISTRIBUTION OF REGULAR OBSERVATIONS (shaded) AND ADDITIONAL 
OBSERVATIONS (unshaded) OF HEIGHTS OF WHEAT SHOOTS ON 28th JUNE 


(By courtesy of the editor of the Annals of Eugenics.) 
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this case the growth rate of the wheat would have been underestimated by 
nearly 10 per cent. had only the additional measurements been available. 
These biases are, of course, of the type that might be expected. When the 
shoots are only half-grown and there is nothing much to be seen except the 
top leaves there will be a tendency to pick the longer shoots, but when the 
crop has come into ear the observer can see shoots of all lengths, and is more 
likely to select shoots somewhere near the average, omitting both very long 
and very short shoots. The strong negative bias of the last set of measurements 
shows that this selection was not particularly effective in improving the 


accuracy of the sample. 


2.5 Bias arising from faulty demarcation of the sampling units 


Any consistent errors in measurement will clearly give rise to bias, 
whether the measurements are carried out on a sample or on all the units of 
the population. The danger of such errors is, however, likely to be greater 
in sampling work, since the units measured are often smaller. Furthermore, 
the knowledge that had another sampling unit by chance been selected a very 
different value might have been obtained, may lead the inexperienced worker 
to believe that accuracy in the measurement of the selected units is of little 
importance. ; 

When the sampling units are not natural units of the population, the selected 
units usually have to be demarcated at the time the measurements are taken. 
In crop sampling work in particular, where small areas are selected in order 
to obtain an estimate of the yield or other characteristics of the crop, location 
of the areas by means of randomly selected co-ordinates, though theoretically 
ensuring a random sample, will only in practice do so if the field work is carried 
out with complete objectivity. Since it is impossible in practice to locate the 
areas according to their co-ordinates by means of exact measurements, pacing 
or some similar approximate method must be used. 

In this type of work the areas themselves should not be too small, both 
because errors in the demarcation of the boundaries become of increasing 
importance as the size of the unit is decreased, and also because the possibility 
of influencing the results by small changes in location, e.g. so as to include 
a particularly good plant, is greater the smaller the unit areas. Very small 
areas are capable of giving completely reliable results with experienced and 
well trained field-men, but may be very unreliable when used by inexperienced 
workers, particularly if the need for complete objectivity is not appreciated. 

Sukhatme (19462, H), for example, has reported the biases shown in 
Table 2.5 in some trial crop-sampling work on wheat. He himself expresses 
the opinion that the biases of the very small areas are due to the inclusion of 
border plants, ‘This, however, would imply that the effective radius of the 
smallest areas, which were nominally circles of 2 ft. radius, would have to be 
increased by nearly 5 inches. Errors of this magnitude appear improbable, 
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unless the observers were very careless in their work, and it seems likely that 
part at least of the bias has been caused by faulty location. ; 

Eye estimates are themselves a form of measurement, but such estimates 
are always subject to bias, which is likely to vary from observer to observer, 
and is often very substantial. If eye estimates are used, steps must therefore 
be taken to eliminate the resultant biases by carrying out proper measurements 
on a sub-sample of the material. A simple example of this is provided by the 
1938-9 Census of Woodlands described in Section 4.25 and Examples 6.12.b 
and7.11. A more complicated example is discussed in Sections 6.15 and 7.14, 


TABLE 2,5—Bras IN THE USE OF SMALL-SIZE AREAS IN SAMPLE SURVEYS FOR 


YIELD (Sukhatme) 


Size of area No. of | Average yield in Percentage 
in sq. ft. areas maunds per acre overestimation 
Irrigated 

471:5 78 10-10 = 
117-9 78 10-58 4-8 
29-5 78 11-69 15:7 
28-3 Liy 11-60 14-9 
12-6 117 14°38 42-4 


Unirrigated 


471:5 107 = 
117-9 107 11-0 
29-5 107 23-4 
28-3 162 14-8 
12-6 161 42-4 


2.6 Bias in estimation 


In addition to biases which arise from faulty processes of selection and 
faulty work during the collection of the information, faulty methods of 
analysing the results may also introduce bias. A simple example occurs in 
the estimation of ratios. If, for instance, an agricultural crop is grown on 
types of land with different levels of fertility, and if the fields on the different 
types of land are of different average size, the mean yield per acre estimated 
from the mean of the yields per acre of all the fields may be markedly different 
from the mean yield per acre of all the land growing the crop, To take a 
numerical example, if there are three types of land having average yields of 
20 cwt., 15 cwt. and 10 cwt. per acre respectively, and fields of an average 
size of 5 acres, 10 acres and 15 acres respectively, the number of fields on each 
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type of land being the same, the mean yield per acre over the whole of the 
land will be given by the weighted mean 
5 x 20 + 10 x 15+ 15 x 10 
5+10+4 15 


13% cwt. per acre 


whereas the mean of the yields per acre of all fields will be 15 cwt. Consequently 
the bias in the estimate by the latter process will be about 12 per cent. 

Biased estimates can be avoided by using the proper processes of estimation. 
This matter will be dealt with more fully in Chapter 6. * 


2.7 Circumstances in which bias is permissible 


Although avoidance of any substantial bias is usually of the utmost 
importance, particularly in censuses on which administrative action has to be 
based, absence of bias is not always essential. In some types of investigation 
a certain amount of bias, provided it is reasonably constant, can be accepted. 
In censuses which are repeated at frequent intervals with a view to determining 
the changes rather than absolute values, for instance, a small overall bias may 
be of little consequence, provided it is constant in time. Similarly in surveys 
which have as their main objective the comparison of different groups of the 
population a bias which is approximately constant from group to group will 
be of little importance. The investigator must also avoid attaching exaggerated 
importance to minor sources of bias which, in fact, can only produce errors 


which are trivial relative to the random sampling error. 


2.8 Methods of reducing the random sampling error 


Once the absence of any important bias has been ensured, attention can be 

turned to the random sampling Scag These must clearly be sufficiently 
i curacy required. 
ar ra dae DE, the simplest way of increasing the accuracy 
of he sample is to increase its size. Other things being equal, the random 
sampling error is approximate y Se A to the square root of 
its included in the a 

the amber ata will, however, depend not only on the number of 
units included in the sample, but also on the owe! per unit ; or, more 
strictly, on that part of the variability per ape ich contributes to the 
sampling error. It is here that the complications © ee procedure, both 
of design and of subsequent analysis, arise. By se processes of selection, 
which while imposing restrictions on fully random se ection do not introduce 
bias into the results, the part of the variability se on which contributes to 
the sampling error can often be substantially r e F the size of the 
sample required for a given accuracy eny eel 

The simplest type of restriction is that snor F as stratification. The 
population is “ stratified”? OT divided into blocks of units in such a manner 


* See also Sections 10.6 and 10.7. 17 
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that the units in each stratum or block are as similar as possible. Each of 
the strata is then sampled at random. If the same proportion is taken from 
each stratum, it is clear that each stratum will be represented in the correct 
Proportion in the sample, and consequently differences between different strata 
are eliminated from the sampling error. 

In addition to stratification there are a number of other devices, which 
will be discussed in more detail later, by which the accuracy of the sampling 
procedure can be increased, often very substantially. The three most important 
are: utilization of supplementary information, use of a variable sampling 
fraction (sometimes called “ optimal allocation ”) and multi-stage sampling. 

Utilization of supplementary information, that is information which is derived 
from sources other than the sampling scheme, or from a more extended sample 
than that on which information on the main characters is collected, takes a 
number of forms. A simple example will illustrate the general principle. 
Suppose that an estimate of the wheat yield of a country is required, and that 
a random sample of wheat fields has been taken and the total yield of each 
field determined. We can then estimate the total wheat yield of the country, 
either (a) by multiplying the total yield of the sample by the reciprocal of the 
Proportion of the fields included in the sample, or (b) by calculating the mean 
yield per acre of the sampled fields (by dividing the total yield of all the sampled 
fields by their total area, so as to avoid bias) and multiplying this mean yield 
by the total acreage of wheat in the country. The latter estimate can only be 
made if the total acreage of wheat in the country is already known with sufficient 
accuracy, e.g. from returns made by the farmers or from a larger sample. 
If this information is available the second estimate is likely to be considerably 
more precise than is the first, since the variability of the total yields, which 
in so far as the yield per acre is constant will be proportional to the areas of 
the individual fields, is likely to be considerably greater than the variability 
of the yields per acre of the individual fields. 

The use of a variable sampling fraction, i.e. the inclusion of different 
proportions of the different strata in the sample, enables the more important, 
or more variable, parts of the population to be sampled more intensively. 
If this is done it will of course be necessary to weight the contributions of the 
different strata to the total in the correct proportions, 

The optimal sampling fractions depend on the relative variability of the 
different strata into which the population is divided for the purpose of taking 
a sample. Thus, if it is required to determine the number of workers in a 
given industry, it will be better to take a much larger fraction, possibly all, 
of the large factories than of the smaller factories. 

In multi-stage sampling the population is divided into a number of first-stage 
sampling units, which are sampled in the ordinary manner, the selected 

first-stage units being subdivided into smaller second-stage units, which are 
also sampled. Further stages can be added if required. Thus, for example, 
in a population survey, a sample of all towns and villages may be taken, and 
in each of the selected towns and villages a sub-sample of all households may 
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be taken, with, possibly, for certain purposes, a further sub-sample of individuals 
from the selected households. 


2.9 Choice of unit 


In some classes of material there is considerable choice in the type and size 
of sampling units, and this gives further scope for increase in the efficiency 
of the sampling procedure. In general, when a given proportion of the material 
is included in the sample, the smaller the sampling units employed, the more 
accurate and representative will be the results. Thus, for example, in an 
agricultural survey, it will be more accurate to take 10 per cent. of all farms in 
each parish, or other small administrative unit, than to take all the farms in 
10 per cent. of the parishes. This will remain true even if multi-stage 
sampling is adopted. It will be more accurate, for example, to take 10 per cent. 
of-all the parishes in each county, with a second-stage sampling of the farms 
of each selected parish, rather than to take all the parishes in 10 per cent. of 
the counties, with the same degree of second-stage sampling. The reason 
for this is fairly obvious. The patishes in any'county are likely to be more alike 
than are those of different counties, and if counties are used as sampling units, 
all the parishes in a county will be included or excluded from the sample 
sl eae small units distributed over the whole of the population 
often conflicts with the administrative requirements. It is clearly easier to 
arrange for a survey of farms in compact areas, such as parishes or counties, 
than to have to survey the same number of farms scattered over the whole 
country. The choice of a suitable balance between these two conflicting 
requirements is often one of the main problems in the planning of a sample 
survey. Furthermore, if only a small number of a Be aie included ih 
the sample, whether or not there is second-stage =op ing of E n i (a 
sampling error will not be well-determined, pa there wi : Telatively few 
differences between units on which to base t k comae of a ; 

We see, therefore, that the choice of samp ae en k cpa ls not Ay 
on the relative accuracy of the different methods, but also on tl eir practical 
convenience. Itis important, for example, that the Pegs of De of the 
sample should not involve excessive ae aed pie n % 5 sae gee 
etc. The most suitable sampling method wi ake oer epen a mue 
on the type of information that is already available on the POEA ation to be 
sampled A method which may be excellent SEa on be cre Boe maps 
are available may be entirely useless in a county w e z ina S mapped. 
Again, it is important that not only chrowitel (dete OUST (OF P a oon 
not involve SENE travelling, but also it should be possible to subject the 

7 zion: consequently, sampling procedures which 
field-workers to proper supervision : | Le e R Ye 
may be excellent with postal questionnaires may y ry 


ial i igators. 
when the information is collected by special investig 
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CHAPTER 3 
THE STRUCTURE OF VARIOUS TYPES OF SAMPLE 


3.1 Definition of frame and sampling unit 


In this chapter we propose to give a technical description of the structure 
of the various types of sample which are most commonly employed in practice, 
and the methods which must be followed in selecting them. The methods of 
obtaining estimates of the population values and of the sampling errors from 
the sample values will be discussed in Chapters 6 and 7. 


relation to the natural subdivisions of the material, 

It is not always necessary to make an actual subdivision of the whole of 
the material before selection of the sample, provided the selected units can be 
clearly and unambiguously defined. Thus, with sampling units which are 
rectangular areas on a map there is no need to demarcate all these areas ; they 
can be defined by co-ordinates, and the selected areas demarcated after selection. 

Clear and unambiguous definition demands the existence or construction 
of some form of frame. In the sampling of a human population, for instance, 
with households as sampling units, there must be available a list of all house- 
holds, and this list must be such that any household selected from it can be 
unambiguously located. In area sampling from maps, the maps must be such 
that the selected areas can be unambiguously defined on the ground, 

The specification of the frame implicitly defines the geographical scope of the 
survey and the categories of material covered. A survey of a human population 
based on a list of households, for instance, will only cover those categories 
of the population which constitute the households included in the list. If 
other categories require inclusion, or if the frame is defective, special steps 
will have to be taken to supplement and emend it. 

In statistical terminology any aggregate of values is termed a population, 
and consequently the whole aggregate of sampling units into which the material 
is divided is known as the population of sampling units. If the sampling units 
are aggregates of the natural units of the material, these natural units will form 
a further population which must be distinguished from the population of 
sampling units. 

In America the term cluster sampling has been applied to sampling in which 
the sampling units are aggregates or “ clusters ” of the natural units. The 
term is a somewhat loose one, since there is often a hierarchy of natural units, 
e.g. a sample in which the sampling units are households may be regarded as 
an ordinary sample of households or as a cluster sample of individuals, 
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In multi-stage sampling there is also a hierarchy of sampling units, first- 
stage, second-stage, etc., corresponding to the different sampling stages, and 
each set of units will form its own population of units. 

Sampling units may be of the same or differing size. They may contain 
the same, or approximately the same, number of natural units, or they may 
contain widely differing numbers. The whole procedure of sampling, including 
the estimation of the population values and the sampling errors, is simplest 
when the sampling units are of approximately the same size and contain 
approximately the same number of natural units. Often, however, the material 
is such that this condition cannot be conveniently fulfilled. In particular, if 
the natural units are themselves of widely differing size, variation in size of 
the sampling units or in the number of natural units they contain is inevitable. 

There is nothing in the sampling process which demands that the sampling 
units should be of any particular size, but, as has been explained in Section 2.9, 
the smaller the sampling units employed the more accurate will be the 
results obtained when a given proportion of the material is included in the 


sample. 


3.2 Random sample 

A random sample is the simplest type of rigorously selected sample, and 
is the basis of most of the more complicated sampling methods. Ina random 
sample, after subdivision of the material into sampling units, the requisite 
number of units are selected at random from the whole population of units. 

As has been emphasized in Section 2.3, random selection implies a strict 
process of selection equivalent to that of drawing lots. In practice it may 
be carried out either by some such process, OF preferably, since adequate 
shuffling of cards, etc., is difficult, by the use of a table of random numbers. 
A small table of random numbers is given at the end of the book. The examples 
of this section illustrate the use of such a table. 

The process of random selection may proceed in two stages. Suppose 
that the population is divided into groups of units containing Xj, Vo, Xg, . - - Vn 


units. The successive sub-totals 
E ee Me Mak ee 


ey eK ee ey = Me x, + Xa + 3s Bey : : 
; Be hee : done on a printing adding machine. The 


are d, which is easily 

ae aie of numbers are then selected at random between 1 and Xn, 
numbers that occur more than once being rejected. A selected number that 
a indicates that a unit of the 


X 
5 = t less than or equal to Xs, a t 
greater than Xs-1, bu Selection of a unit at random from this group, which 


de on the basis of the number already selected, will 
f completely random selection. , 

f value when the full numbering or demarcation 
re sampling is laborious, since only the total 
d be known. It is of particular value when 
ed areas, and the total areas of natural 


can if convenient be ma 
then give the equivalent o! 
This two-stage process is © 

of the units in all groups befo 
number of units in each group nee 
the units are artificially demarcat 
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subdivisions of the material are known. By using the process only the units 
in the selected groups have to be numbered or demarcated. 


Example 3.2.a 
Select a sample of 20 from a population of 2879 units, 


Using the four-figure numbers given by the first four columns of digits 
in Table A. 1, and rejecting all numbers greater than 2879, we obtain the 


1490, 2046, 526, 797, 2699, 1465, 2467, 1753. 

The above procedure results in the rejection, in this example, of nearly 
three-quarters of the random numbers given by the table. Various devices 
may be used to avoid this. In the present example the simplest is to take 
the numbers 3001—6000 and 6001-9000 as equivalent to 0001-3000, rejecting 
the numbers 9001-9999 and 0000. Using the second column of four-figure 
numbers gives the sample 1373, 2467, 227, 2599, 2635, 1794, 1753, 378, 1234, 
2632, 792, 897, 1064, 2819, 1712, 1837, 2722, 1504, 13, 2565. 

If with either of the above Procedures the same unit is selected a second 
time, the number leading to this selection is rejected, and an additional 
number taken, 

It will be noted that neither of these samples is evenly distributed over 


)the whole range of units. The distribution between the different thirds of 
the range is in fact : i 


Numbers 


lst sample 2nd sample 
i) en e 4 5 
961-1920 : i 3 : 10 8 
1921-2879 è P 5 r 6 7 
20 20 


Random selection will give samples that deviate somewhat from an even 
distribution, the actual deviations being themselves governed by statistical 
laws. Exact statistical tests show that about three out of four samples will haye 
smaller aggregate deviations than the first sample, but only three out of ten 
will have smaller aggregate deviations than the second sample.* 


Example 3.2.6 


Select unit areas 45 mile x 3 mile at random from a rectangular area 
5 miles x 4 miles. 
There are 2000 unit areas, which can best be defined by co-ordinates 1-50 
along the longer side of the rectangle, and 1-40 along the shorter side, the 
* The appropriate test is that known as the X? test. A description of this test 
will be found in most modern statistical textbooks. 
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co-ordinates selected defining the corner of the unit area furthest from the 
corner of the rectangle (0, 0). The selection of a number at random between 
1 and 50, and a second between 1 and 40, will therefore select a unit at random 
Taking the third column of four-figure numbers (beginning 8636) and following 
the second of the procedures of Example 3.2.a gives the pairs of co-ordinates 
36, 36; 12, 02; 16, 16; 14, 38; ete. 

If points instead of unit areas are to be selected each co-ordinate range 
should theoretically be infinitely subdivided. The actual degree of subdivision 
need not usually be very fine. 

The procedure of this example may be used for the selection of unit areas 
larly shaped area, provided the extreme range of 
ed, points falling outside the area being rejected. 
More elaborate processes, involving less rejection, can of course be devised 
but care must be taken that the probability of selection of all areas or points 
is equal. Thus in a triangular area, the selection of lines parallel to the base 
ces from the base, followed by the selection at random of 
triangle on each of the selected lines, will give a greater 
r the apex of the triangle. The selection of points within 
a circle by the selection of random distances and bearings from the centre 
will give a greater density of points near the centre. In irregularly shaped 
areas, also, fractional unit areas requiring special treatment will occur at the 


boundaries. 


or points from an irregu 
each co-ordinate is includ 


at random distan 
a point within the 
density of points nea 


Example 3.2.¢ 


14 streets in a ward cont 
17, 23 houses respectively. 
371 houses. 


ain 25, 17, 5, 59, 64, 22, 38, 16, 21, 12, 14, 38, 
Make a random selection of 6 houses from all 


tals are 25, 42, 47, 106, 170, 192, 230, 246, 267, 279, 
293, 331, 348, 371. A table of random numbers gives the numbers 72, 128, 
96. 526, 199, 202. The units 72 and 96 therefore fall in the 4th street, the 
unit 128 in the 5th street, the units 199 and 202 in the 7th street, and the unit 
326 in the 12th street. Since 12 — 47 = 25, and 96 — 47 = 49, the 25th 
and 49th houses in the 4th street are selected, etc. The numbering of four 


streets, involving 199 houses, 


The successive sub-to' 


is required 


3.3 Stratification with uniform sampling fraction 


ation of sampling units is subdivided into 
n of the sample. These strata may all 
contain the same number of units, oF differing numbers of units. If a uniform 

fraction of the units of each stratum is 


sampling fraction is used, the same s ; 
included in the sample, the units selected being chosen at random from all 
the units within éach stratum. A stratified sample is thus equivalent to a set 


In a stratified sample the popul 
groups or “ strata” before selectio: 
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of random samples on a number of sub-populations, each equivalent to one 
Stratum. 

Stratification has two purposes. The first is to increase the accuracy of 
the overall population estimates. The second is to ensure that subdivisions 
of the population which are themselves of interest are adequately represented. 
Such subdivisions may be termed domains of study. Maximum overall accuracy 
will be attained if the strata are so chosen that the units within each stratum 
are as similar as possible. It will often be advisable to use domains of study 
as strata, however, even if some other form of Stratification might be expected 
to give somewhat more accurate results. If there is marked heterogeneity 


Stratification affects the estimation of the sampling error. Since in a 


only be done from differences between units in the same stratum. It is 
therefore necessary, if an estimate of sampling error is required, that the strata 
be of such size that the sample contains two or more units from at least the 
majority of strata. In certain cases, in which the use of strata containing only 


If the sampling units are already classified in the 
selection of a stratified sample can be made in the same way 


If, however, the population is not so classified, selection by this method would 
necessitate prior classification. In this case, if the numbers of units in the 

This consists 
of selecting a sample at random ; keeping a tally, as the selection Proceeds, 
of the numbers falling in each stratum ; and rejecting any further members 
of a particular stratum as soon as the requisite number for that stratum has 
been obtained. On the other hand, if the numbers of units in the different 
strata are not known, a count covering the whole population will in any case 
have to be made, in which case a classification which will serve as a basis for 
the subsequent selection of the sample may well be carri 


taken. We may thus differentiate between the worki 
which with stratification with a uniform sampling fraction is the same for all 
strata, and the exact sampling fractions, which will differ slightly from the 
working sampling fraction. The use of the working sampling fraction in the 
analysis of the results leads to minor inaccuracies, but these will seldom give 
rise to errors of any practical importance, 
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It may be noted that if the numbers of units from the whole population 
falling in different strata are known, and a random sample is taken which is 
sufficiently large to ensure that adequate numbers of units are obtained from 
all strata, adjustment of the results so that the different strata are represented 
in their correct proportions will lead to practically the same accuracy as would 
be obtained with a stratified sample. Which of these two alternative courses 
is adopted in any particular case is a question of convenience. If the selection 
of either type of sample is equally simple it is best to use a stratified sample, 
as the computations are thereby simplified. In certain cases, however, the 
classification of units into strata may only be possible by means of information 
obtained in the course of the survey, in which case a random sample, with 
subsequent adjustment, is required. Thus, for example, in a survey of a human 
population, the age distribution of the whole population may be known, but 
prior selection of individuals of particular ages may be impossible owing to 


lack of information on these ages. 


3.4 Multiple stratification 

may be stratified for two or more different characteristics. 
de from sub-strata formed of the various combinations of the 
main classifications the procedure is exactly equivalent to ordinary stratification, 
the sub-strata being equivalent to strata. Thus we may stratify farms 
according to size and according to geographical regions. If the farms in each 
region are classified into size-groups before taking the sample then the region— 
size-group combinations form the individual sub-strata. 

Occasionally the number of units of the population falling in each set of 
main strata may be known, ¢8- from prior census data, but not the numbers in 
the various sub-strata. Thus, in the above example there may be information 
on the numbers of farms in the different size-groups, and also on the numbers 
in the different geographical regions, but not on the numbers of each size-group 
in each region. In such cases we may attempt the selection of a sample which 
will have the right proportions for each set of main strata. Such stratification 
may be termed multiple stratification without control of sub-strata. The selection 
of such a sample, however, presents both theoretical and practical difficulties, 
and the calculation of the sampling error is also troublesome. 

In the rare cases in which multiple stratification without control of sub-strata 
is deemed to be necessary 2 simple procedure of selection which should give 
a reasonably satisfactory sample is as follows. Units are selected at random 
until the total of every TOW and column of ‘the two or more-way table for the 
sets of strata is at least equal to the required total. The excesses of these 

d, and numbers chosen for deduction from the 


marginal totals are calculate 
sub-strata totals which together make up these excesses, and which, subject 
to these restrictions, are about proportional to the sub-strata totals. (A method 
of calculating such numbers is shown in the following example.) The 


its are then rejected from the sub-strata groups, 


A population 
H selection is ma 


corresponding numbers of un: 
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those rejected being selected at random. If the original selection was strictly 
random the condition of randomness will be fulfilled if those last selected 
are rejected. 


Example 3.4 


After a sample of 1125 units had been drawn, the numbers of units in the 
16 sub-strata shown in Table 3.4.a were obtained, 


TABLE 3.4.a—Two-way STRATIFICATION WITHOUT CONTROL OF 
SUB-STRATA : INITIAL SAMPLE 


i. ee 


Strata A 
Strata B Total |Required| Excess 
1 2 3 4 
1 37 40 35 8 120 120 0 
2 39 140 82 56 317 280 + 37 
3 45 97 173 93 408 350 + 58 
4 8 40 86 146 280 250 + 30 
Total . 129 317 376 303 1125 1000 125 
Required 120 280 350 250 1000 — et 
Excess . +9 + 37 + 26 + 53 125 — — 


The three stages of the calculation are shown in Table 3.4. b. In stage 1 
the excesses of the rows have been distributed in proportion to the numbers 


0 0 0 0 0 0 0 0 
B, EI TIE 9 vi oi = ot ae 0 
B, 6" Te Dey 28 68 f=2 +s —9 4 gal 5 
B, 1 4 9 16 30 0 0 -4 49 $ö 
Total : 12 34 43 36 125 7-3 +3 =17 +17 0 
Required 9 37 26 53 125 
Difference | +3 —3 +17 —17 0 . 
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TABLE 3.4.b—Continued 


Stage 3 

} Ay da Ay Ay Total 

| 
B, 0 0 0 0 0 
B, 0 0 0 0 0 
B, +100) +1 +2 +1(+2) | +5 
By 0 2i =1(-9 —3(-2) |-5 
Total .|+1(0) Cay LO — 2 (0) 0 


ES ee S O 


"TABLE 3.4.c--NUMBERS OF UNITS REJECTED, NUMBERS IN FINAL SAMPLE AND 
f CORRECT SUB-STRATA TOTALS 


Numbers of units rejected Numbers in final sample 


A A, As Ay | Total A, A; As Ay | Total 
1 


0 0 0 o | 37 40 35 8 | 120 
Bi o O Or I Aa TT" aee 
Bi oe Sige wel ae BS || Alig S115) d 73 N50 
B i 3 31 88 30 ee s7. (88, 1231 IESS 
4 
5: 25 120 280 350 250 1000 
Total 9 37 26 53 125 
Correct sub-strata totals 

A, Ay As A, Total 

: 30 10 120 

Bı it 120 80 40 280 

Pe an 80 160 80 350 

Bs au 40 80 120 250 

Bi 10 
350 250 | 1000 
Total . | 120 280 


its i row. The distributed excesses are added by 
ete Ae pram required exces Be ee ae 
signs reversed, are distributed by columns in eens ZEN arene Weed a 
ane the process js repeated for rows. s ae hi i SB en chosen to make 
and empirical adjustments, shown in brackets, have be ; 
the column totals aio 
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The three stages are then summed and deducted from the numbers in the 
original sample, as shown in Table 3.4.c. This table also shows the correct 
proportional sub-strata totals of the population from which the sample was 
drawn. The appropriate statistical test* shows that a satisfactory sample has 
been obtained. 

The above process does not necessarily converge, but will usually do so 
in practical cases. If a negative value for number of units rejected is obtained 


if different sampling fractions are used for the different strata, The greatest 
accuracy for a given number of units will be attained if the sampling fractions 
are proportional to the Within-strata standard deviations} of the units, If the 
sampling fractions are denoted by fi, fe, = + . and the standard deviations by 
©, S2,... we have 

fh _f_ 


a o 


In some cases this formula may give sampling fractions greater than unity for 
some of the strata. If this occurs the whole of these strata are included in the 
sample. Ẹ 

A particularly important application of the variable sampling fraction jg 
to material stratified into size-groups. In such material the various quantitative 


different size-groups. In this case the sampling fractions should be taken about 
proportional to these mean sizes. If quantitative characte 
correlated with size of unit are under investigation, the r 
groups may give good estimates of the relative within-strata standard dey 
The sampling fractions may then be taken Proportional to these 
Changes with time, however, are usually by no means so highly 
size, and when the changes are of interest, sampling fra 
the mean sizes of the size-groups will usually be best. 
The above rules will determine sampling fractions which give the maximum 
accuracy for estimates of the population values, In cases in which the values 
for the individual strata are of interest, z.e. cases in which the strata themselves 
form domains of study, it is also important to see that all the Strata are adequately 
represented in the sample, and for this reason the rule of strict Proportionality 


ranges, 
correlated with 
ctions proportional to 


*X2 — 9-1, 9 degrees of freedom, 

+The meaning of this term and the method of estimation will be explained in 
Chapter 7. ; 

$ When variations in cost have to be taken into account ee formula 8.17.b is 
appropriate, 
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to the standard deviations, or to the mean sizes of the size-groups, often requires 
some modification.* 

When several quantities are under investigation, it will usually be found 
that their within-strata standard deviations are not in quite the same proportions. 
This, however, is not a very serious problem in practice, since any sampling 
fractions which are somewhere near the optimal will give results which are 
nearly as accurate as those given by the optimal fractions. ` Consequently 
there is usually no great difficulty in choosing suitable sampling fractions 
which will reconcile the various conflicting requirements. t 

Since, for these reasons, sampling fractions are often used which are not 
ve preferred not to adopt the term “ optimal allocation,” which 


optimal, we ha 
been used to denote stratified sampling with a variable sampling 


has sometimes 


fraction. y, 
The within-strata standard deviations can only be estimated from data 


relating to the material to be sampled, or from data derived from similar 
material, but general knowledge of the behaviour of material of a particular 
type, e.g. material stratified into size-groups, Will often enable suitable sampling 
fractions to be chosen with all necessary accuracy. It is sometimes suggested 
that a preliminary survey should be undertaken merely to determine the optimal 
sampling fractions, but this is rarely worth while, though if a preliminary survey 
is being undertaken for other purposes it will of course also serve to improve 


the sampling fractions. 


3.6 Systematic samples from lists 

rtance of the principle of random selection in sampling 
has been stressed, much practical sampling is in fact not fully random in 
character. Thus a frequent method of selecting a sample, when a list of the 
units of the population to be sampled is available, is to take every qth entry 
on this list. This may be termed a systematic sample from the list. Other 
more complicated systematic procedures may occasionally be adopted for 


special purposes. 

It is customary, 
number at random b 
convert the sample into 
equivalent to a fully rando 
No lists, however, are arrang 
order is probably provided by 
non-random characteristics : 
of the Scotsmen will be found 
a kind of partial stratificatio 
will be somewhat more precis 

* The situation when domains of study cut across strata is discussed in Sections 
9.3 and 9.4, 

oblem is given in Section 10.4. 


+ An exact solution of this pr : ; 
that the sample is a random sample of 1 unit out of 


t Except in the trivial sense o ) 
q units, each unit being composed of the aggregate of a set of all entries at spacing q. 
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Although the impo 


and salutary, to determine the first entry by selecting a 
etween 1 and g, but this element of randomness does not 
a random one.{ A systematic sample would be 
m sample if the list were arranged wholly at random. 
ed at random. The nearest approach to random 
alphabetical lists, though even these have certain 
in this country, for instance, a large proportion 
under the letter M. If every gth entry is taken, 
n will therefore be obtained, and the sample 
e than a fully random sample. Thus in a 
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systematic sample of farms taken from a list of farms arranged by parishes, 
the proportion of farms drawn from each parish will be more or less constant, 
provided the sampling interval is small compared with the number of farms 
in a parish. * 

Owing to the lack of definition of the strata it is impossible to make a fully 
valid estimate of the sampling error, but provided there are no periodic 
features in the list the sample will not be biased. An estimate of the sampling 
error which is good enough for practical purposes can usually be made by 
regarding the sample as a sample stratified in the major subdivisions of the list, 
ignoring any minor and ill-defined groupings. If the sampling error is estimated 
as if the sample were fully random an overestimate will be obtained, the 
inaccuracy being greater the more marked the similarity of the neighbouring 
entries in the list. 

In general, systematic sampling from lists will be found to be quite 
satisfactory provided care is taken to see that there are no periodic features 
in the list which are associated with the sampling interval. The method is 
often much more convenient than random or stratified random sampling, 
since the labour of making a proper random selection, which in an extensive 
sampling scheme is often very considerable, is avoided. It must be clearly 
recognized, however, that the responsibility for the judgment that the material 
is such that systematic sampling will give satisfactory results rests with the 
investigator. 

‘Sampling in which the selection is wholly systematic should be clearly 
distinguished from sampling in which there is proper random selection of 
sampling units which are themselves systematic aggregates of smaller sub- 
divisions of the material. Thus a common method of sampling rows of potatoes 
has been to use sampling units consisting of every 20th plant, two such sampling 
units being selected at random from each row by selecting two numbers at 
random between 1 and 20. Such a method of sampling fulfils all the conditions 
required for fully valid random sampling. 


3.7 An example of alternative ways of sampling highly variable 
material 


n the Preceding Sections, we will consider their application to 
the problem of determining the area under w. 


this purpose the wheat acreages of Hertfordshi 


* Methods of taking a systematic stratified sample from a fist or card index are 
described in Section 10.2. 
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of crop acreages in a country such as England, since all farmers make returns 
of their acreages each year ; the only possible use of sampling would be at 
the compilation stage, where it might be used to avoid the necessity of totalling 
the whole of the returns. The present investigation is not fully relevant to 
this problem, however, since the acreage of only one of the most extensively 
grown crops is considered, and that for only one county. Considerably greater 
errors, proportionately, may be expected in the less common crops. 
Records were available for 2496 farms, and the acreage of wheat, and also 
the total acreage of crops and grass, which is virtually the total acreage of 
farmed land and will be termed the size of the farm, were abstracted for each 
farm. The original records were arranged by districts, by parishes alphabetically 
within districts, and by farmer’s name alphabetically within parishes. This 
order was preserved in the abstract. The return for any farm, or “ holding,” 
does not necessarily relate solely to land in a given parish, but may include 
land in other parishes farmed by the same farmer ; farmers with two or more 
te returns for these farms or may include them 


distinct farms may make separai 1 
all in a single return. The total area of wheat in the county, from the abstracted 


returns, was 44,676 acres, and the total area of crops and grass was 273,074 
acres.* 

If farms 
wheat acreage will be variation 


are taken as sampling units the dominant source of variation in 
in size of farm, since farms range from 1 acre to 
over 1000 acres, and no farm can have more than a fraction of its area under 
wheat, Stratification by size of farm is therefore indicated. The use of a 
variable sampling fraction will also be advantageous, since the wheat acreages 
of the large farms will be much more variable than those of the smaller farms. 
Further stratification by districts is possible, but is not likely to give much 
increase in precision unless the incidence of wheat growing in the different 
districts is very markedly different. In any case a systematic method of 
selection from the list, which in view of the alphabetical method of arrangement 
will be quite satisfactory, will give the effect of stratification by districts. 
For comparative purposes the following samples were taken : 


(1) a random sample of 1 in 20 farms, 125 farms in all ; 
(2) a stratified random sampl 
(3) a stratified systematic sample with a variable sampling fraction, ae 
fraction being approximately proportional to mean size of farm within 
each size-group, and chosen so as to give about the same number of 
farms, actually 135, as samples 1 and 2; the systematic method of 


selection within size-groups results in approximate stratification by 


districts also. 


e with a uniform sampling fraction of 1 in 20 ; 


* It may b t these values disagree with the values shown in Agricultural 
Sipe noted see 380 acres respectively. The reasons for this 


Statistics, viz 278 

5 > . 46,281 acres and 278, 7 = i 

discrepancy need not concern us here, but it provides an illustration of the fact 
that disaeeement between sample and complete returns must not be assumed to 


be necessarily solelyedue to sampling error. 
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c 
sampling fractions and numbers of farms for sample 3 are shown in Table 3.7 a. 
For sample 2 the last two size-groups were combined. 


TABLE 3.7 . a— HERTFORDSHIRE FARMS, 1939: SIZE-GROUPS, 
NUMBERS OF FARMS, AND CHOSEN VALUES op THE VARIABLE 
SAMPLING FRACTION 


Size-group P 
(acres crops i Sampling 
and grass) fraction 


e for each farm in the sample, and the 


and in the case of the acreage the actual errors of the estimates, a 
in Table 3.7.b. The sampling standard errors, as wil] 


acreage that, as is to þe expected, both Stratification and the use of a variab] 
riable 
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as the stratified sample with a fixed sampling fraction. This is also to be 
expected. 

The numbers of units, i.e. farms, required to attain the same accuracy 
different methods of sampling and estimation may be taken as roughly 
to the squares of the standard errors, allowance being made 
r number of units included in sample 3. These are shown in 
’ the random sample with direct estimation 


with 
proportional 
for the greate: 
the column “ relative variance,’ 


TABLE 3.7.b—HERTFORDSHIRE FARMS: COMPARISON OF VARIOUS TYPES OF 
SAMPLES AND METHODS OF ESTIMATION 


Wheat acreage No. of farms growing wheat 


Method of 
No selection 
{À and Relative Relative 
estimation ; Standard| Actual s Esti Standard S 
Estimate variance |Estimate| variance 
error | error | per farm error | per farm 
j Le | —| | 
la | Random, 46,020 | + 7,950! + 1,340 100 900 | + 104-6 100 
direct | | | 
| | 
lb | Random, 41,100 | + 4,320 | — 3,580 30 860 |+ 7 52 
stratified | 
after 
selection 


Not calculated 


lc | Random, 41,570 | + 3,940 | — 3,110 


by ratio 
ld | Random, 40,400 | + 4,130 | — 4,280 27 Not calculated 
Beaton | 
2 Stratified, 40,220 | + 4,110 | — 4,460 | 27; 1,080 | + 71:6 47 
direct | 
| 10 911 + 88-9 72 


3 Variable 
sampiing 


fraction, 
direct | 
1 
or elimination of variation due to 


of farms required by a 
lts in a further 


being set at 100. Thus stratification by size, 
size in the estimation process, reduces the number 
factor of about 4, and the variable sampling fraction resul 
reduction by a factor of about 2}. 

__The situation with respect to number of 
different. Stratification has again resulted in consi 
hon the gain is not so great as with acreage. 
sampling fraction, on the other hand, is not so accura 
sample. The sampling fractions which are optimal for 
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farms growing wheat is somewhat 
derable increase in accuracy, 
The sample with a variable 
te as the ordinary stratified 
‘the determination of 
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wheat acreage are by no means optimal for the determination of the number 
of farms growing wheat. 

The actual errors of the estimates bear little relation to the sampling 
standard errors, except that they are in no case markedly larger than these 
standard errors. The random sample without adjustment gives the most 


give an accurate estimate, The accuracy of a sampling procedure must never 
be judged by the magnitude of a single discrepancy; a large discrepancy 
Provides some evidence that a method is inaccurate, but a single small 


3.8 Multi-stage sampling 


In multi-stage sampling the material is regarded as made up of a number 
of first-stage sampling units, each of which is made up of a number of second- 
Stage units, etc. The sampling process is carried Out in stages, At the first 
Stage the first-stage units are sampled by some Suitable metho ; 


By suitable choice of sampling fractions it is often possible to keep the over- 
all sampling fraction (z.e. the product of the sampling fractions at the different 
stages) constant for different Parts of the population. This leads to considerable 


in Section 2.9, it Permits the concentration of the field work of censuses and 
Surveys covering large areas, On the other hand 
a multi-stage sample is in general less accurate than is a sam 
the same number of final-stage units which ha 
suitable single-stage process. 

Multi-stage sampling also has the important advantage that subdivision 
into second-stage units, i.e. the Construction of the Second-stage frame, need 
only be carried out for those first-stage units which are actually included in 


Since there are many variants of multi-stage sampling which are Possible 
for any given type of material, careful Investigation is often required before a 
decision as to the procedure which is best for any particular Purpose can be 
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reached. This matter will be discussed in detail in Chapter 8, after the methods 
of evaluating sampling errors have been described. 


3.9 Sampling with probabilities proportional to size of unit 


If we have areas demarcated on a map, such as fields, and a point is located 
at random on the map, the probabilities of the point falling within the boundaries 
of the different fields are clearly proportional to the areas of the fields. 
Consequently areas can be selected at random with probabilities proportional 
to their size by the simple procedure of taking random points on the map. 
It will be noted that such a process of selection may result in the same area 
being included twice or more in the sample. In this case it must be counted 
twice or more. We cannot, without distorting the probabilities, make a further 
selection in the manner followed with equal probabilities. 

The principle has applications in agricultural surveys designed to determine 
the acreage and yield of different agricultural crops, total cultivated area, etc. 
All that is required for acreage is to determine the proportion of points which 
fall in areas of the given type. The method is therefore particularly attractive 
when carrying out surveys of the areas of crops, etc., by aerial survey, provided 
the different crops can be recognized on the photographs, since it avoids all 
the measurements of area which would be required if an ordinary random 
sample of areas were taken. The sampling of the fields with probabilities 
proportional to size is in this case equivalent to the sampling of small unit 
areas of equal size whose locations are determined by the random points. 
When only areas require to be determined the sizes of the fields in which the 
random points fall are in fact immaterial. . : 

The analogy with the case of a stratified sample with a variable sampling 
fraction indicates that under certain circumstances greater precision may be 
expected from areas selected with probabilities proportional to size than will 
be obtained if they are selected with equal probabilities. ? 

In the case of yield determinations, when the total acreage is known, the 
determinations of the yield from a sample of fields selected with probability 
proportional to size may always be expected to give a more accurate estimate 
of the mean yield per acre and total yield than will similar yield determinations 

s irrespective of size. If the total acreage is not 


on a random sample of field : i 
PRA chen the a is more complicated, but here again sampling with 
Probabilities proportional to size is often advantageous. 


Sampling with probabilities proportional to size of unit, or to some other 
known quantitative character of the units, may be carried out on other types of 
material by forming a cumulative or running total of the sizes of the units, and 
selecting numbers at random from the total of all the units in the manner of 
Example 3.2.c.* Stratification by size and the use of a variable sampling 
fraction will usually be preferable in such cases, however, on the grounds both 
of accuracy and convenience, except in the special circumstances to be described 
in the next section. 


* An alternative method is described in Section 10.8. 
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3.10 Sampling from within Strata with probabilities Proportional to 
size of unit 


Apart from area sampling, sampling with probabilities Proportional to size 
of unit is mainly of use when the units are stratified according to some other 
characteristic, and the number of units to be selected from each stratum, or 
from some of them, is small. In this case an ordinary stratified sample will 
give either inaccurate or biased estimates when the ratio method of estimation, 
explained in Chapter 6, is used. The bias or inaccuracy is removed by selecting 
the units from within Strata with probabilities Proportional to size, This 
fact appears to have been first recognized by Hansen and Hurwitz (1943, A), 


igate the effect of taking Parishes as sam 


c lin 
units, or as first-stage units in tw ee 


o-stage sampling with farms as Second-stage 


to 

rO ampling uni - 

stage units in two-stage sampling. In analogous S nde 
countries, where definition of farm boundaries may present diffi lti a 

complete survey of small administrativy, eree Te 

any attempt to sample individual farms. iA 

Inspection of the Hertfordshire data showed tha 
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parishes being grouped roughly in the order in which they appeared in the 
alphabetical list, to form “ combined” parishes containing over the minimum 
acreage. The effect of this combination is shown in Table 3.11.a. 


“ 


TABLE 3.11.a—NUMBERS OF FARMS, PARISHES, AND “‘ COMBINED ” PARISHES 
IN THE DISTRICTS OF HERTFORDSHIRE 


y J | No. of 
istri No. of No. of 5 
wee District ishas parishes after 
Ne; a panne | combination 
1 Barnet . . ` | 17 7 
2 Bishop's Stortford . 23 16 
3 East Herts. 31 20 
4 Hitchin : 36 25 
5 St. Albans - . | i 8 > 
6 Tring - . . g 1 1 
7 Watford ` . 5 
140 91 
oo oo i o o I 
Districts were used as strata in this sampling, 1 in 5 “ combined ” parishes 
being taken per district, i.e. 17 parishes in all. 


Two samples were taken. In sample A the parishes were selected in the 
ordinary manner, with equal probability of selection for each parish. In 
sample B selection with probabilities proportional to size was employed. The 
parishes of sample B were also sub-sampled in two ways, samples B, and Bp. 
In sample B, a uniform sampling fraction of } was taken for sampling at the 
second stage, with stratification by size, using the size-groups of Table 3.7.a 
with the last two size-groups combined. In sample B, a variable sampling 
fraction was used with values yy for size-group 1-50 acres, } for 51-150 acres, 
4 for 151-300 acres, and 1 for over 300 acres. Sample B is given in detail 
in amp ce i fficiency of the various methods is discussed in Section 8.9. 
The results are summarized in Table 3.11.b. This table is similar to 
Table 3.7.b, except that estimated average values of the standard errors are 
given, and ee those calculated from the actual selected samples. These latter 
are not sufficiently accurate for comparison owing to the small number of 

ary i 
i i will be ae that a sample of 1 in 5 parishes provides results which are 
decidedly more accurate than a stratified random sample of 1 in 20 farms 
with a uniform sampling fraction, but somewhat less accurate than a similar 
sample with a variable sampling fraction. The stratified random sample of 
1 in 20 farms is 1-29 times as accurate as sample By, allowing for differing 
numbers of farms, The similar sample with the variable sampling fraction is 
1-83 times as accuraté as sample B2 Sample B is somewhat more accurate 
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or sample A. The difference is not marked, however, since the com- 
bination of parishes has created units which do not differ excessively in size. 


TABLE 3.11 . b—HERTFORDSHIRE FARMS : SAMPLES FOR WHEAT ACREAGE WITH 
s COMBINED ” PARISHES AS SAMPLING UNITS OR FIRST-STAGE UNITS IN 


TWO-STAGE SAMPLING 
P | ; j | | 
A Method of Sampling 
| No. Method 


2 i Expected | A 
| S H meee Actual | Relative 
Sample oot | | ti A [Estimate standard error | variance 
| Stages Ist stage | 2nd stage | SUmation 
Overall 41,730 =+ 3,080 pe 2,950 100 
ratio | f | 
A 1 Stratified by — | 
district District 41,010 =Æ 3,010 | — 3,670 95 
| ratios | | 
| 
B 1 Stratified by — District | 46,660 | Æ 2,870 | + 1,980 87 
district, ratios i 
Probability i 
proportional 
to size | j | 
| 
B, 2 a | Stratified District | 48,930 | E 4,950 | + 4,250 259 
random ! ratios | 
B, 2 T Variable | District | 45,600 | + 3,460 |+ 920 127 
Sampling ratios | 
i 


fraction 
3.12 Multi-phase Sampling 


ther phases may be added if required, 
application 
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information on the whole population is available. Thus, in a crop estimation 
survey based on farms as sampling units, a relatively large sample of farms 
may be taken for the determination of the acreage of the crop, and the yields 
may be determined on a sub-sample only of these farms. 

If the first-phase information is collected prior to the second-phase 
information the first-phase information may be used as a basis for the sub- 
sampling process, e.g.. by stratification of the first-phase units for the selection 
of the second-phase sample, with or without the use of a variable sampling 
fraction at the second phase. on, 

It will be noted that in both these latter applications of two-phase sampling 
the methods followed are the same as those adopted in ordinary single-phase 
sampling, the population being replaced by the first-phase sample ; but since 
the first-phase information is not known for the whole population it is itself 
subject to sampling error, and this must be taken into account when estimating 
the sampling errors of the estimates of the second-phase variates,  — 

Multi-phase sampling differs structurally from multi-stage sampling in 
that in the former the same sampling units are used throughout, whereas in 
the latter a hierarchy of sampling units is used. Multi-phase sampling may be 
combined with multi-stage sampling. In a scheme for the estimation of the 
acreages and yields of agricultural crops, for example, a two-stage sample of 
farms and parishes may’ be taken for the estimation of acreages, and a sub- 
sample of these farms may be taken for the estimation of yields. 


3.13 Balanced samples 

If the average value of some quantitative character of the units, such as 
size, is known for the whole population, it is possible, provided the sizes of 
the individual sampling units are known, to select a sample in such a manner 
that the average size of the selected units is equal to the average size of all 
the units of the population. Such a sample will only be satisfactory if it is 
otherwise equivalent to 4 random sample, in which case it may be termed 


a balanced sample. 
Balance may be emp 

character, In this case balanı 

or for each of the strata separately. 


loyed in conjunction with stratification for some other 
ce may be effected either for the whole population, 
The latter course should only be adopted 
if the number of units selected from each stratum is moderately large : otherwise 
undue restrictions will be placed on the sample which will result in the selection 
of a sample which is not otherwise equivalent to a random sample. On the 
other hand, when the strata are balanced separately more accurate estimates 
of the separate strata means and totals will be obtained, and the accuracy 
of the estimates of the overall population means and totals may also be 
somewhat į 
eana a eee quantitative character provides an alternative to 
stratification by size-groups in this character. Balancing, however, will only 
be effective if the differences in the quantity or quantities under investigation 
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are approximately proportional to differences in the known ch 
stratification by size-groups will take account of any type of 
will be seen in Chapter 7, the estimation of sampling errors 
case of a stratified sample. 

The increased accuracy resulting from balancing can 


‘aracter, whereas 
relationship. As 
is simpler in the 


Variates increases, 
hich is inherently 


om sample of the required size is 
i The average value of 
ulated for the sample. This will, 
he average for the Population, indicating lack of 
» and compared with the first 
Substitution of the 


ee 
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the average values of other characters in the sample and population was poor, 
and that of the frequency-distributions of such characters was even worse. 
The real weakness here is the use of excessively large units, though even with 
smaller units the use of purposive selection without rigorous rules of selection 
is always liable to give unsatisfactory results. There is, moreover, no means 
of judging its reliability. 

For these reasons purposive selection has ceased to be extensively used, 
and in modern sampling work it has largely been replaced by more thorough 
application of the principles of stratification, etc. Provided proper attention 
is paid to the process of selection, however, there is no fundamental objection 
to balanced samples. These have a certain limited usefulness in some types 
of census and survey work, though it must be recognized that the need for the 
subdivision of the population into an adequate number of sampling units is 


in no way obviated by balancing for one or more quantitative characters. 


3.14 Systematic samples from areas 

A common method of sampling material continuously distributed either 
in space or time is to take sampling units distributed at equal intervals over 
the material. The chief application in census and survey work is in the 
sampling of land areas. When maps are available the sampling units can be 
located by superimposing a grid of points, frequently of square, or nearly 
square, pattern. Such a sample may be termed a systematic area sample. 

A systematic area sample differs from a systematic sample from lists mainly 
in the spatial distribution of the sampling units over the material. Most lists 
do not correspond at all exactly, except for major groupings, to any physical 
distribution, and a systematic sample from a list therefore usually approximates 
much more closely to a random sample than does a systematic area sample. 
Different methods of estimating the sampling error are therefore appropriate 


in the two cases. 

In general, provided 
will be rather more acc > 
per stratum) from strata consi 


there are no periodic features, a systematic area sample 
urate than a stratified random sample (with one unit 
sting of rectangular blocks (or cells) whose centres 
are situated at the systematic sampling points. In material in which the 
variation is of a continuous nature it is impossible to make any accurate estimate 
of the sampling error without taking supplementary sampling points, though 
if there are no periodic features an Upper, limit can be obtained. 

If the regions near the boundary are likely to differ from the remainder 
of the area, as may be the case if the boundary is a natural one, such as a sea 
coast or a mountain range, it will be best, after locating the sampling grid at 
random, to demarcate the bounding lines of the cells, and sample at random 
the area which is not covered by complete cells, dividing it into equal or 
approximately equal areas and locating one sampling point at random within 
each of these areas, It will be convenient, if possible, to make these cell areas 
equal in area to those of the sampling grid, since equal weight will then be 
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given to all sampling points. The Same method of dealing with boundaries 
can be used if the sampling is random within rectangular cells, as 
Systematic sampling is entirely unsuited to material which has periodic 
features, but apart from this will generally provide a satisfactory method of 
area sampling. It has the advantage over Stratified random sampling from 
blocks that the location of the sampling units is simpler and the results obtained 


points or areas. In such cases sets of Parallel lin 
as the sampling units. In Stratified random line sampling, 
into rectangular blocks of convenient length and of such a width that two 


blocks. In Systematic line sampling the sample is made u 


pted. The method 
the ground in undeveloped country 


hotographs, The 


, the sampling becomes two-stage. If the lines 
and the points on them are both evenly spaced the Sampling is equivalent to 
Systematic point sampling. 


S are measured. A car fitted with 
i S of the yield near 
harvest time can also be obtained in a simi i 


tting and harvesting a small area of 


Systematic manner, such as entering 
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roads are by no means randomly located with regard to agricultural crops. 
The results from surveys following the same route in successive years may well 
be comparable, however, and with calibration by more exact methods from 
time to time, road cruising may provide a satisfactory method of making rapid 
and inexpensive surveys. Similar methods based on tracks are possible in 


areas with only a sparse road network. 


3.16 The principle of the moving observer 


If counts are required of a collection of individuals who are moving about, 
the ordinary methods of sampling can only be applied with difficulty. Thus, 
to determine the number of people in a crowded street by ordinary methods 
would require the demarcation of a number of small areas in the street and 
the counting of the number of people on each of these areas. The counts 
need not necessarily be simultaneous, but for any one area the number of 
people present at a given moment has to be counted. Unless photographic 
methods are available, or the areas are very small, such counts are extremely 
difficult, since individuals are continually moving into and out of the areas 
and are also moving about within them. } Lay 

Equally it is no use stationing observers at fixed points with instructions 
to count passers-by. The number of people in a street will depend not only 
on the numbers passing fixed points but also on the velocity of movement up 
and down the street. If all exits and entrances to the street are covered, and 
there are no people in the street at the start of the counts, the number present 
at any subsequent time can be determined from counts that are continuous 
and without error. In practice, however, errors in counting usually result in 
cumulative errors which invalidate the results. Thus it was found impracticable 
to determine the numbers in a department store by posting people at the doors 


to make counts. È Yas : 
These difficulties can be overcome by using moving instead of stationary 
te of the number of people in a street, the 


observers. To obtain an estimate Ot th i 
direction, counting all the people he passes, 


observer traverses the street in one Are z 
in whichever direction they are moving, and deducting all the people who 


overtake him. He then re-traverses the street in the opposite direction, moving 
at the same speed and counting as before. If this is done the average of the 
two counts gives an estimate of the average number of people in the street 
during the time of the counts. If people are mostly moving in one direction 
the count in this direction will be reduced, but the count in the opposite direction 
will be correspondingly increased. In practice the deductions required for 
those overtaking the observer can be kept small by moving at a speed greater 
than that of the majority of the crowd. € 

This method was used to estimate the numbers of people in streets, shops, 
etc., at different times of the day, in order that the adequacy of the provisions 


for public air raid shelters might be tested. It was found that very dense crowds 


in streets and shops could be estimated with surprising ease. Crowded streets 
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were dealt with by teams of two or more observers 
in a transverse line, with each observer counting the people between him and 
the next observer. In the large stores the floor was divided i 
were assigned to the different observers, 


The method is of general application. It can be used, for example, to assess 
populations of insects or animals in a State of movement, provided all individuals 


can be readily seen, and provided the Passage of the observer does not itself 
influence the movement of the individuals. 


3.17 Interpenetrating samples 


It is often advantageous to take two or more independent Samples of a 


given population, using the same sampling procedure for each sample. Such 
samples are called interpenetrating samples, 
Interpenetrating samples are of value if the 
carried out by successive Stages. This is frequentl 
results are required quickly. Thus in the 19 


Survey or census has to be 
y necessary when Preliminary 
42 Census of Woodlands of 
» it was necessary to obtain a 


estimate of the total timber content of the who 
required, 


that separate and 

ation are furnished, 
The agreement of such estimates i i o the layman than 
any statement of the sampling error. 


O groups, with 


e likely to give 
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a satisfactory pair of interpenetrating samples. ‘The separate samples would 
be subject to variation between counties, and would therefore be considerably 
less accurate than the combination of the two, from which variation between 
counties is entirely eliminated. The proper use of interpenetrating samples 
therefore necessitates increased expenditure on travelling. 


3.18 Sampling on successive occasions 


The types of sampling so far discussed are appropriate to a census or 
survey carried out on a single occasion, with the object of determining the 
characteristics of the surveyed population at or about a given point in time. 
If the population is subject to change, a survey carried out on a single occasion, 
however accurate, cannot of itself give any information on the nature or rate 
of such change. In certain types of population extraneous sources of 
information, such as registrations of births and deaths, may be relied on to 
provide information on the changes which the population is undergoing. Even 
in such cases the census must be repeated at intervals, both because of 
the extraneous information, which may lead to a gradual 
accumulation of errors, and also because the information is rarely of such a 
nature that all aspects of the original census or survey can be kept up-to-date. 
Registration of births and deaths, for example, coupled with figures for 
immigration and emigration, will furnish data for the revision of the total of 
the population but will not enable changes in the population of separate towns 
and districts to be determined. i ; 

In many cases no such extraneous information on the changes that are taking 
place is available, and in such cases provision must be made for periodical 
re-survey if up-to-date information is required. A number of alternatives 


then present themselv' 

(1) A complete census 
intervals. 

(2) A sample census © 

being selected on 


(3) A sample census OF su 

(4) Part of the sample may be replaced on each occasion, the remainder 
being retained. If there are a number of occasions a definite scheme 

of replacement may be followed, eg. one-third of the sample may be 

replaced, each selected unit being retained (except for the first two 
occasions) for three occasions. bg 

(5) A re-survey of a sub-sample of the original sample may be made. In 


the case of a complete census this is equivalent to a re-survey of a sample 
of the whole population. 


The following terms are 


(2) independent samples; (3 
(5) sub-sample. It will be noted that indepe 


* See also Section 10.14. 45 


inaccuracies in 


eS — 
or survey may be repeated in its original form at 


r survey may be repeated at intervals, a new sample 
each occasion without regard to previous samples. 
rvey may be repeated on the same sample. 


suggested for the last four alternatives: 
) fixed sample; (4) partial replacement ; 
ndent samples are formally 
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equivalent to interpenetrating samples, a fixed sample is formally equivalent 
to the observation of different characters (variates) on the same sample, and a 
sub-sample is formally equivalent to a two-phase sample. Only partial 


these units as well as on the relative importance of information on the 
population means and on the changes in these means. If, for instance, the units 
are very variable but the changes of all units are similar, accurate information 
on change can most easily be obtained by re-survey of a fixed sample of units ; 
provided always that Proper provision is made for new entrants to the population, 
and for the elimination of the disturbance which results from the extinction 
of selected units. If, on the other hand, information on the Population means 
is of paramount importance, Partial replacement or a sub-sample will usually 
be preferable. A more detailed discussion, in terms of the errors to which the 


with sampling on Successive occasions, Firstly, repeated re-survey of the 
same units may be inexpedient, since resistance to the provision of the necessary 
information may be engendered, and secondly, rep 
in modification of these units relative to the rest of the Population. This can 
arise in many ways. In a survey of agricultural Practice, for instance, visits 


sampling based on lists of houses may be best in the towns. There is, of 
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3.20 Combination of complete census and sample survey 


It sometimes happens that a complete enumeration of the population can 
easily be made, but that detailed information on the individual units of the 
population can only be obtained by sampling methods. In such cases a complete 
enumeration will often be of value as a frame for the sample census. Thus 
in a census of a human population, 2 complete enumeration consisting of lists 
of households and of numbers in each household can be made. A sample 
of houses can then be visited by investigators so as to obtain details regarding 
the age, sex, etc., of the occupants. Such a sample census will not only serve 
to provide the required detailed information, but will also provide a partial 
check on the accuracy of the complete enumeration. It will not, however, 
provide a check on omissions from the lists of households. To carry out such 
a check it will be necessary to take a further sample of properly defined areas, 
checking that all the households in the sample areas have been included in 
the full census returns. aa h A 

A complete census, even if it is very inaccurate, 1S also of the greatest 
use in planning @ more accurate sample census. In a sample census of a 
human population, for instance, some knowledge of the relative sizes of different 
towns and villages, and of the density of population in rural areas, is essential 
for the proper allocation of resources. Similarly in a census of agriculture, 
knowledge of the amount of cultivated land in different parts of the country 
i if excessive survey of largely uncultivated areas is to be avoided. 
S BET k ation provided by an inaccurate complete census can also be 
bat e a the accuracy of a subsequent sample census, by the methods 
applicable to supplementary information which will be given in Chapter 6. 
Here, however, we must proceed with caution. If, for instance, a complete 
emis of a human population consistently underestimates the population of 
villages of all sizes by about 10 per cent., the sopi census will determine 
the amount of the underestimation and a common adjustment can be made. 

derestimated by 20 per cent. and large villages 


villages are un A F i 
Tf, however, small villag mmon correction will result in the under- 


5 lication of a co ack ae 
che sei sus ime population of small villages and the overestimation of the 
on 


i villages. This distortion will be avoided if separate 
aaa a SrA for small and large ies canes it is not 
always possible to be. certain that all poenta tien eS oe ae 
Re These differential inaccuracies are particu ar y eubles arise ve 
Ke ; iated Wi the administrative areas for which separate 

ey tend to be associ x raill results, however, will not be materially 


results are uired. Th Rott robe d va ER 
affected by AT inaccuracies of this kind if the methods of estimation 


given in Chapter 6 are followed. 
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CHAPTER 4 


PRACTICAL PROBLEMS ARISING IN THE PLANNING 
OF A SURVEY 


es with 
as are required 
» and agriculture, 


by the term quality control, presents rather different problems which are not 
discussed in this book. The sampling problems encountered 
research are also omitted from the discussion, 

The questions that require consideration at the planning sta 
and surveys may be broadly classified as follows ;— 


(1) Specification of the purposes of the survey, 


ge of censuses 


stage, etc., determination of size of sample required, and 
selection. 

(7) Decision on whether the Survey is to be an isolate 
without intention of Tepetition, or is to be plann 
repetition at intervals, 

These questions cannot be considered į 


a greater or less extent any decision taken on one question will influénce the 
decisions that should be taken on the others, T 


jointly, or if independent decisions are made these should at least be regarded 
as tentative and subject to modification until the b 


: s Knowledge is also 
required of the Ways in which it is practicable to collect the required information 
with the necessary accuracy. prior knowledge 1n these matter. 


— 


.the planning stage. Similar! 
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adequate knowledge of the statistical properties of the material, pilot surveys 
are frequently advisable in large-scale surveys in order to test and improve 
field procedure and schedules, and to train field workers. 

Questions arising under heads (1), (2), (8), (4) and (7) of the above list 
are common both to complete censuses and surveys and to sample censuses 
and surveys. Even here, however, the problems encountered differ considerably 
in the two cases, owing to the greater scope for the collection of detailed 
information and the execution of complicated observations by the sampling 
method. : nae, 

The determination of the items on which information is to be collected, 
the degree of detail to be attempted, and the ways in which the information 
can best be obtained, often constitute the most difficult and crucial part of the 
planning of a survey. No amount of care in the planning of the sampling 
or skill in the analysis will compensate for failure in this respect. A survey 
in which the information collected does not adequately cover the field to be 
investigated at the best provides a partial and incomplete picture, and at 
the worst may be irrelevant or actively misleading. 

Careful consideration must therefore be given at the outset to the purposes 
for which the survey is to be undertaken, the type of information it is proposed 
to collect, and the uses to which the information obtained will be put. In 
the case of large-scale surveys, which are likely to provide information that 
will be of value to a number of different organizations or government depart: 
ments, a detailed statement on these points should be prepared. D this way 
those who are likely to want to make use of the amle of the ER w a fully 
apprised of its nature, and can if necessary make suggestions for modifications 


before the survey is begun. 
The statistician who, W1 
presentation of the results 


ll ultimately be responsible for the analysis and 
should, if possible, be selected and appointed at 
ly if the advice of a statistical expert is to be sought, 
this should be done, in the first instance, at the fae ie This vile 
applies even in the simplest types of census. It Cae | appens z Si 
censuses are undertaken without any prior consultation with a sa sire 
whose advice is only sought when the results have been collected and the 
stage of analysis is reached. 


4.2 Definition of the population 
i ich require to be included in a 
i es of material, which i l 
aeie categories, Or PE scope, are conditioned in broad outline by the 
SI End its gece dministrative and research requirements 
broad outline, however, there is often 
1 consideration should therefore be 
i i rticularly those on 
AVi ; : one al categories, pa Í 
Sieh Gos ae g likely to be specially difficult, or for which 
i important marginal categories 
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the task of collecting the information may often be very materially simplified, 
without seriously reducing the value of the results. : 

A census of the human population residing in a given territory, for example, 
should ideally include all individuals present in that territory at a particular 
moment, and in simple censuses an attempt is usually made to attain this 
end. It is often, however, difficult to obtain information on certain minor 
categories, such as nomads. These difficulties occur even in a complete census, 
but are often more marked in the case of a sample census. The question of 
whether such categories may be omitted entirely without serious loss should 
therefore be considered. 

The matter becomes of even greater importance when a human population 
census requiring the collection of detailed and complicated information is 
undertaken, using skilled investigators making visits to individual members 
of the population. In such cases visits to members of the population with a 
permanent residence, even if they are absent from their residence at certain 
times, are relatively simple, but it is far more difficult to cover the floating 
elements of the population. The conduct of such a census becomes very much 
simpler, therefore, if these latter elements can be omitted. 

Tn a similar manner, in the case of an agricultural census, the determination 
of the areas of the various crops might ideally require that all areas of the crops 
grown within the boundaries of the territory should be included. It may, 
however, be possible to exclude small areas, such as those found in gardens 
and holdings of very small size, without seriously reducing the value of the 
information. The agricultural censuses of England and Wales, for example, 
which are based on returns from farmers, exclude all agricultural holdings of 
less than one acre, and do not attempt to take account of crops grown in 
private gardens or allotments. 

The question of whether or not minor categories should be included 
depends mainly on the purposes for which the information is required. A 
case is sometimes made for the inclusion of certain categories on which the 
information is intrinsically of little interest in order to ensure 
with the results of previous censuses or Surveys, ör with the results of parallel 
surveys in other countries. Comparability within and between statistical 
series is obviously desirable, and lack of it can seriously reduce the value of 
the results, and also increase the labour of statistical analyses and the danger 
that those unfamiliar with the details of the various sources of information 
may draw wrong conclusions. i Nevertheless when introducing a radically new 
method of collecting information, such as replacement of a complete census 
by the sampling method, excessive weight should not be given to past practice. 
It should not be forgotten that so-called complete censuses are often in 
themselves subject to errors of various kinds, including lack of completeness 
and that such errors are often a greater source of disturbance to comparability 
than the omission or inclusion of a few minor categories. If there is any serious 
doubt whether a given category should or should not be included this may be 
regarded as prima facie evidence that the category in question differs in 


comparability 
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essentials from the other more important categories. Consequently, if it is 
decided that the category should be included, the results should be kept separate 
so that they can be summarized separately and eliminated from the final 
estimates if required, or given special treatment in these estimates. If this 
is done for the first one or two surveys of the new type, comparability with 
previous results will be ensured, without preventing the omission of the 
category in subsequent surveys if this ultimately appears desirable. 

The arguments in favour of the adoption of identical definitions in different 
countries in which conditions are radically different are even less strong. 
Categories which are of very minor importance in one country may be of 
great importance in another. Decisions as to their inclusion or omission should 
be taken primarily on the grounds of their importance in the country which is 
being surveyed, without undue regard to definitions designed to ensure formal 
uniformity of world statistics. r 

In many cases in which complete omission of unimportant categories 
would not be justified, they can be very conveniently dealt with by some 
special sampling procedure, which may be multi-phase, or may be of an 
entirely different type with different frame and sampling units. Thus in a 
human census, certain of the simpler items of information, which can be 
reliably furnished by neighbours or other members of the household, may be 
collected for absentees abroad, or a sub-sample of these absentees may be taken 
for a follow-up enquiry by more intensive. methods. Nomads may be dealt 
with by instituting a supplementary sample census to deal only with this 
category of the population. ae Et D 

In a sample survey the frame adopted contains its own implicit definitions 
of the categories ‘of material to be covered. If a category is not included in 
the frame it will either have to be omitted entirely or special steps will have to 
be taken to supplement the frame. Definitions of the population should therefore 
be considered in conjunction with the choice of frame. 


4.3 Determination of the details of the information to be collected 


The detailed problems which arise in deciding what information is necessary 
and how it can best be obtained vary widely in surveys covering different 
fields of enquiry and according to whether the results are required primarily 
for administrative or for research purposes. Full discussion of any particular 
case necessarily requires extensive knowledge of the subject as a whole and 
of the particular questions at issue, and would be out of place here, but there 
certain general points which may be mentioned. | 
The basic problem is essentially that of the selection of the most relevant 
s of information or types of observation from all those which it is practicable 


are 


ee 2 3 3 

ne collect and which might conceivably have a bearing on the matters under 
A This selection must be such that a coherent whole is obtained 
i 


which covers the required field adequately, or if this is not possible at least 
provides information on some relevant part of it. 
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This basic problem is essentially the same in complete censuses and sample 
censuses. and surveys, but the problem is more complex in the case of sample 
censuses and surveys, since the items of information that can be collected 
and the observations that can be made are themselves more complex and 
varied. 

The best way of arriving at a satisfactory solution of this basic problem 
is usually as follows. In the first instance, the details of the information 
required to deal with the problems originally propounded are determined. 
The question is then considered whether there are any related problems of 
importance on which this information, possibly supplemented to some 
extent, would throw light. If this is the case the supplementary items of 
information required for the full elucidation of these additional problems should 
be determined. With the whole field mapped out in this way, the practicability 
of obtaining the necessary items of information covering any given set of 
problems can be considered, and final decisions taken in the light of the 
relative importance of these problems and the total load which it is considered 
expedient to place on the investigators and respondents in a single survey. 

The details of this process vary greatly in different types of survey, but 
the general principle to follow in all types of survey is to see that the items 
of information collected form a rounded whole covering a definite subject or 
coherent group of subjects. 

This principle is of particular importance in surveys of the questionnaire 
type on human populations, whether the questionnaires are filled in by the 
respondents themselves, or the information is elicited by field investigators. 
Accurate information can only be obtained in such surveys if full and willing 
co-operation of those providing the information is obtained. The survey must 
therefore have a clear purpose which can be explained to the respondents, 
and the questions asked must be relevant to this purpose. If additional 
questions dealing with unrelated subjects are included, or if the questions 
relating to the main enquiry seem trivial, and do not cover aspects which 
appear of importance to those providing the information, the survey will cease 
to appear as a serious enquiry into a particular subject, and will meet with 
unfavourable reactions, summed up in such terms as “ snooping.” 

The matter is of importance even in enquiries which require the collection 
of factual information by observation and measurement by the investigators 
themselves, without any co-operation from respondents. If the field 
investigators are not imbued with a sense of the importance of their enquiry, 
and are overloaded with the collection of miscellaneous data, they will not 
give of their best. Occasionally information may be sought on points 
unconnected with the main survey if it is urgently needed, and considerable 
expense is thereby saved, e.g. in travelling, but this should be avoided as far 
as possible. à 

Occasionally, in cases in which a questionnaire would otherwise be unduly 
long it may be possible to split it into parts, obtaining information on one 
group of items from one set of respondents, and on another group of items 
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from a second set. Certain basic items of information will be required from 
all respondents, and the two sets will form a pair of interpenetrating samples. 
The sampling has also a two-phase structure, the basic information acting as 
first-phase information for both sets. If this procedure is followed, however, 
the relationships between items of information in the two groups can only 
be studied for strata or other suitable domains of study, and not for the 
individual respondents. 

Certain items of information are often required in order to ensure the 
proper interpretation of other items. Thus, for example, if housewives are 
being asked whether they prefer coal, gas or electric cooking, and the reasons 
for their preferences, it is essential to ascertain in some detail what experience 
they have had of methods of cooking other than the one they are now using, 
including the type of apparatus used. If this is not done the answers may be 
more an indication of the effectiveness of an advertising campaign in favour 
of one of the methods, or a condemnation of antiquated pieces of apparatus, 
rather than any reflection of the true relative merits of the different methods. 

Information is also often required on items which, though not of primary 
t as supplementary information and thereby enable the precision 
of the results to be increased by the appropriate methods of estimation. 

In reaching the decisions on the type of information ei both in broad 
outline and in detail, it is absolutely essential to work in collaboration with 
experts ‘oth the subjects which it is proposed to cover. If m E 
administrative experience in the subjects to be covered is lacking, it is atal y 

g designing a survey to omit some vital items of information. A simple 
sede pee 8 mission is provided by the 1921 and 1931 Population Censuses 
DRAR oi r ia dom. In these censuses information on age of mother at 
of tieg vee EE of children born, which had been obtained in the 
A E $ e a! asked, with the result that the value of the information 
pel po ie a cae far studies on changes of fertility of the population 
provided by apes reduced. . As a result of this lack of information it was 
felt S is ae to institute a special Bamnily ae eee ee aie 
In this instance it can scarcely be that the nee 5 informa a if ws x 
overlooked, but insufficient weight must clearly have been given Sap 
Of censüs information- ion with experts in the various subjects, 


Rn : t 
In addition to direct collabora irculated at all stages of development to 


the ey should be ci é F : 
the e and individuals who are likely to be interested in 


: i f 
the s will usually result in requests for the collection o; 
BU ue Re information, some of which n eae ee 
the purpose for which the survey Was originally plann T usefulness of the 
the results to be used for other purposes: a ee 7 ther hand the danger of 
survey may often be considerably increased. On i pen aca eee 
overloading the survey with the collection of misce: Ka fore be very carefully 
must be guarded against, and all requests should therefo: y 


reviewed. 


interest, will ac 
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4.4 Inter-relations of groups of natural units 


If the physical inter-relations between the members of groups of natural 
units of the material under survey are of interest, or if information is required 
for groups of natural units as a whole, then information must be collected 
for such groups as a whole, or at least for pairs of units from such groups. 
Thus if the inter-relations between the different members of a household 
require to be studied, it is essential to have information for pairs of individuals 
belonging to the same household, and it is usually best that the information 
should cover the whole of a household. This can be ensured by using 
households or dwellings as sampling units. 

Another type of natural aggregate for which it is often important to obtain 
results as a whole is that provided by towns, villages, etc., and, in agricultural 
surveys, homogeneous geographical areas. This often calls for the adoption 
of multi-stage sampling, the natural aggregates forming the sampling units 
at the first stage, even in cases where the use of single-stage sampling is 
otherwise preferable. Thus in a survey of a human population it may be 
of considerable interest to contrast the results for individual towns of differing 
types, and to study the inter-relations existing within a single town, even 
when there is no need for all the towns of the country to be covered. 

Similarly, if inter-relations between the behaviour of the same individuals 
or other natural units at different times are of interest, the survey must be 
designed so as to provide information covering an adequate period of time. 
Thus in an investigation into hours of sleep of children, it is of little value to 
determine the amount of sleep of a sample of children each for a single day. 
Such data will throw no light on the question of whether children who have 
a short period of sleep on a particular day tend also to go short of sleep on 
other days or are able to make up for this short period by longer periods on 
preceding or following days. In the same way, studies of nutrition in which 
the intake of food is determined for each individual for a single day only, 
although they will show whether a group as a whole is under-nourished, are 
incapable of revealing the degree of variation in under-nourishment between 
individual and individual, since individuals going short of food on a particular 


day may make up for such deficiencies, in whole or in part, on succeeding 
days. 
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respondents, either consciously or unconsciously, to telescope events, and 
report them as happening in the given period when in fact they happened 
earlier. Thus a survey of crockery breakages made by asking what breakages 
occurred over the past week led to an entirely excessive estimate of the amount 
of breakage, whereas a similar survey asking for breakages over the past yea 
gave results which checked well with production figures and the onat 
stocks (Box and Thomas, 1944, D, discussion). 


4.5 Practicability of obtaining the required information 


considering the problem of determining what items 
e d in order that the purposes of the survey may be 
fulfilled. Each item must, however, be considered in the light of the 
practicability of obtaining it. If the information is to be furnished in response 
to questions, the points for consideration are whether the respondents are 
sufficiently informed to be capable of giving accurate answers ; whether, if 
the provision of accurate answers involves them in a good deal of work, euch 
as consulting previous records, they will be prepared to undertake this work ; 
whether they have motives for concealing the truth, and if so whether they 
will merely refuse to answer, or will give incorrect replies. If the information 
is to be obtained by observation or physical measurement, the points for 
consideration are whether the observations are such that they are within the 
competence of the investigators Or other individuals who will be required to 
undertake them ; whether they will make excessive demands on the time of 
the investigators or others, or require excessively expensive apparatus ; and 
whether the owners of the surveyed material will permit the observations to 


be made. 1 inevi 
s kind will inevitably lead to modifications of what 


Considerations of thi ! 
cheme. Nor can general answers be 


would otherwise be considered an ideal s ! 
given, even within the limits of a particular field of enquiry. In countries 


such as the United Kingdom, for example, there is no reason to suppose that 
any large amount of inaccuracy is introduced into the returns of the population 
censuses by deliberate mis-statements. In countries not accustomed to 
population censuses fear that the information will be used for S uch purposes 
as taxation or conscription may lead to considerable inaccuracies. Similarly 
in crop-sampling work the use of small sample areas may be quite satisfactory 
with certain classes of field worker, but, as 1S shown by Table 2.5, is entirely 


unsatisfactory in other cases. 
When the ideal requireme 
to include other items of inform 


So far we have been 
of information are require 


nts cannot be fully met it is sometimes possible 
ation, observations, Or physical measurements, 


which, owing to their high correlation with the quantities which it is desired 
ets ee E e Jess adequate substitutes for these quantities. 
ermine, 


Thess R be used for purposes of stratification or 
ubsti res ma 
itute measu y as for example when the rateable 


classification of in the final analysis, 4 ; 
“oor e ee as a substitute for the income of the household 
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occupying it; or they may be substitutes for measures of quantities which 
themselves require assessment, as for example the use of eye estimates in place 
of direct measurements of the yields of a standing crop. The efficiency of such 
substitute measures can only be properly judged by a proper statistical 
investigation of the relations between them and the quantities for which they 
are substitutes. In the case of substitutes for measures of quantities that are 
to be assessed, some method of calibration is essential if objective estimates 
of the original quantities are to be obtained. (The calibration of eye estimates 
is discussed in Sections 6.15 and 7.14.) 

It will inevitably happen in certain cases that information which is of 
considerable importance will prove to be unobtainable, or unobtainable with 
sufficient accuracy. When such a situation arises it must be squarely faced. 
There is at times a tendency to attempt to collect information which, because 
of its nature, cannot be obtained with the necessary accuracy, and then 
to condemn the survey method in general because the results are of little 
value. i 

This, however, does not mean that the collection of difficult items of 
information should not be attempted. The sample survey procedure, because 
it makes possible the use of skilled investigators working on a relatively small 
sample, is frequently capable of eliciting reliable information on points which 
it would be quite impossible to include in a general enquiry. The fact that 
the enquiry is on a small sample, if known to the respondents, frequently 
makes them willing to give information which they would certainly not be 
prepared to give if the enquiry were general. In such cases it is important 
that the investigators should themselves be recognized as impartial and 
disinterested ; in particular they should not be officials of an organization 
which itself might make use of the information obtained to the detriment 
of the respondents. 

Nevertheless there are subjects on which it is impossible to collect accurate 
information from a random sample of the population. In certain of these 
cases information can be collected from a selected group of individuals, 
e.g. individuals with whom social welfare workers are in contact. Information 
of this type is not necessarily valueless, but it must be clearly recognized that 
it is not the equivalent of information obtained from a random sample of the 
whole population, and any attempted generalization of the results will be of 
limited yalidity. 

Attempts are sometimes made to obtain a sample from such a group of 
individuals which conforms more closely in certain respects to the population, 
e.g. in classification by age or social class, than does the group as a whole. 
While this may improve the sample somewhat, it still does not provide the 
equivalent of a random sample. On the other hand, if the whole of the group 
is not required, it is usually advisable to apply some rigorous form of selection 
rather than to permit the workers themselves to select individuals for 


investigation, as the latter procedure will merely introduce further unnecessary 
elements of bias. 


56 


PROBLEMS ARISING AT THE PLANNING STAGE SECT. 4.6 


In cases in which some of the items of information are difficult to collect. 
multi-phase sampling may be of value. It may, for instance, enable specially 
skilled investigators to be used for the more difficult items. Thus in a health 
survey medically qualified investigators may be used on a small sub-sample 
of a much larger sample on which more general items of information clei 
to health have been collected. Equally it may be used to reduce the work 
required to manageable proportions. ‘Thus, in the Survey of Fertilizer Practice 
soil samples for chemical analysis were taken from one old-arable field, one 
new-arable field, and one field of permanent grass on each farm, these fields 
being a sub-sample of all the fields on which information on the use of 


fertilizers was obtained (Section 4.23). 


4.6 Methods of collecting the information 


The methods of collecting the information are to a large extent conditioned 
by the material under survey and the type of information required. Where 
the alternative possibilities exist, it may be stated as a general rule that 
observations are preferable to questions, and questions on facts and on past 
actions are preferable to questions on generalities and on hypothetical future 
conduct. Thus it is better to inspect a house to see if it shows signs of damp, 
than to ask the occupant if the house is damp ; and it is better to find out 
what considerations, from among the various alternatives (if any) that presented 
themselves, governed the selection of the house in which the occupant is living, 
rather than to ask what type of dwelling—house, flat, bungalow, etc.—is 


“ preferred.” ; 
On the other hand, it is scarcely possible to state any general rule with 
d qualitative observations made by the 


regard to physical measurements an servi 
investigator. Physical measurements are more objective, but qualitative 


observations are often more capable of summing up the salient features of a 
complex situation. Thus a qualitative grading by the investigator of the degree 
of dampness of a house is likely to be more effective than any physical measure- 
ments designed to determine the eeren ea ae Moreover, by proper 
standardization and calibration among investigators qualitative observations 


cafi themselves be made objective: ; 

: hen the information is colieered by menn of a Mee fom AGL 
naire the questions which are to be asked show” ae i sidered at the planning 
singe oe cation Waaa obiained will spend on the exact form of these 
questions, Equally the exact form of any observations and physical measure- 


Ments whi ` -d should be determined. 
ch are required shou ; : ; 
ensus forms ad questionnaires may be designed either for comp eo 
by the res a ith little or no assistance from investigators, or for 
pondent wit ions put to the respondents. 


completion 5 . ‘ by the aid of quest 
t ator f : : 
In Aa REES: ae the investigators may be instructed to 


ask uesti 7 È wording, or they may be instructed to elicit 
questions with a given cogs to the questions of the questionnaire 


information which will provi 
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by enquiry and discussion without adherence to any exact form of words. 
Both means of eliciting information may be required in the same survey for 
different items of information. 

Census forms and questionnaires designed for completion by the respondent 
may be delivered and returned by post, delivered by post and collected by 
an enumerator or investigator, or vice versa, or delivered and collected by an 
investigator. Use of the post is clearly most economical, and is the method 
generally followed in censuses and surveys of industrial and commercial 
organizations, such as censuses of production. In such cases the use of investi- 
gators will not normally have any great advantage over the post, either in ensur- 
ing more complete response or obtaining more accurate information, though 
occasionally in local surveys investigators may be used to explain the purposes 
of the survey and persuade the respondents to co-operate. In population 
censuses, however, investigators are normally used both in order to ensure 
the maximum response, and to give assistance where necessary in filling up 
the forms. Censuses and surveys of small-scale industrial and commercial 
organizations, and of farms, occupy an intermediate position, and the method 
used will depend to a large extent on local circumstances. 

Attention must be paid to the detailed wording of all questions, even if 
these are only intended as guides to the investigator. If the question itself 
creates a wrong impression in the mind of the investigator this will undoubtedly 
lead to errors, even if additional explanatory notes indicate that something 
else is really required. 

Careful thought must also be given to the order of the questions. If questions 
are arranged in an orderly sequence the investigator’s task is much easier, and 
the respondent’s reaction is likely to be more favourable, This applies to all 
forms of questionnaire, but is most important in the verbal questionnaire, 

In many types of survey it is profitable to give the investigator or respondent 
an opportunity of making general remarks on special points. This can be done 
very simply by including a space for observations. Some guidance should 
be given on the type of observations required. Although such observations 
do not easily lend themselves to exact analysis they are frequently of considerable 
value in drawing attention to relevant facts not covered by the questionnaire itself. 

The type of investigator to be employed must also be considered. 
Investigators should have a background knowledge of the subject under 
investigation, particularly in investigations of the research type. Ina technical 
investigation into housing conditions, for instance, the investigators should 
have some knowledge of housing construction and of standards normally 
adopted in good Practice. This requirement of technical knowledge in the 
investigators limits the scope of unspecialized teams of investigators. Such 

ying out ad hoc and routine investigations which 
ple questionnaires, but they are no substitute for the 
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In surveys requiring any high degree of technical knowledge it is usually 
best either to use members of existing organizations, or to appoint a small 
specialized team of technically qualified research investigators. The various 
surveys into the technical and economic aspects of agricultural practice in 
England and Wales, for example, are carried out by the staffs of the National 
Agricultural Advisory Service and the Provincial Advisory Economists. By 
this means teams of investigators are obtained who are technically qualified 
and capable of discussing the problems involved with the farmers; at the 
same time the investigators themselves gain a wider knowledge of the farms 
of their district which is of value to them in their other work. 


4.7 Methods of dealing with non-response 


Unless non-response is confined to a small proportion of the whole sample 
the results cannot claim any general validity. Every effort must therefore 
be made to reduce non-response to negligible proportions. 

Non-response is usually most serious in postal questionnaires. Delays 
in response can also sometimes be very troublesome, particularly when the’ 
results are required quickly. A rigorous system of dealing with failure to respond 
and delay in response must therefore be instituted at the outset. The first 
step is to send a follow-up letter, but if this does not produce the required 
effect, the possibility of using more intensive methods such as telephone calls 
and personal visits must be considered. These will require a special regional 
organization. ? | . 

In censuses of industrial and commercial undertakings in which data on 
sales, labour force, etc. are required for the purposes of economic 
planning it is usually possible to make the returns compulsory, This is often 
a help in dealing with a small minority of recalcitrant institutions, particularly 
if pressure can be brought to bear in other ways, but it is no substitute for full 
and willing co-operation by the majority. Complete population censuses 
are usually also made compulsory, and there appears to be no logical reason 
uses of the same type should not also be compulsory. While 
n dealing with obstinate refusals, since the census authorities 
bring the offenders before the courts, it is an indication 
gard the census as of importance, and to this extent 
e force with the waverers. 
be repeated at intervals it is particularly important 
response and delay in response at the outset, as 
progressively. If any large volume of non- 
y serious delay in making the returns, it is 
g with the census, which should either 


production, 


why sample cens 
this is little help i 
are not likely to wish to 
that the government regar 
is likely to act as a persuasiv 
In censuses which are to 
to deal vigorously with non- 
otherwise they tend to increase 
response persists, or if there is an 
an indication that something is wron 
be reorganized or abandoned. 
In sociological surveys using 
non-response is usually small. If 
investigator used should be reviewed. 
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can be tried, but are not likely to be very effective: In technical surveys of 
agriculture involving interviews with the farmers the amount of deliberate 
non-response is also usually small, unless the amount of information required 
is such that it puts too heavy a burden on the respondents. 

In sociological surveys, however, initial non-response due to failure to 
contact the respondent can be very troublesome. There is no proper way 
of dealing with this except by persistent call-backs. The number of call-backs 
can often be reduced by enquiring of neighbours when the respondent is likely 
to be at home, or where he can be found so that an interview can be arranged. 
Call-backs are also required because the respondent, though willing to give 
the information, is otherwise engaged at the time of the first call. 

The amount of work involved in follow-ups and call-backs can be reduced, 
if this appears desirable, by taking a sub-sample of those not contacted at the 
first (or subsequent) call, and weighting up the sub-sample in the final results. 
In repeated censuses, however, complete follow-ups are advantageous in 
encouraging better response to later censuses. 


4.8 The frame 


The whole structure of a sampling survey is to a considerable extent 
determined by the frame. The methods of survey which are suitable for a 
given type of material may be radically different in different territories because 
different types of frame have to be used. Consequently, until particulars 
of the nature and accuracy of the available frames have been obtained, no 
detailed planning of the survey can be undertaken. If no frame exists, the 
construction of a frame suitable for the purposes of the survey may well constitute 
a major part of the work’ of the survey. 


Frames are subject to various types of defect, which may be broadly 
classified as follows. A frame may be: 


(1) Inaccurate. 

(2) Incomplete. 

(8) Subject to duplication. 
(4) Inadequate. 

(5) Out of date. 


A frame may be termed inaccurate if information about the units listed 
in it or defined by it is inaccurate. The term may also be used to cover the 
listing of units which do not in fact exist. Thus a ration-card list in which 
certain women were incorrectly described as married when they were in fact 
single, or in which certain individuals were included who had died, would be 
inaccurate in these respects. 

A frame may be said to be incomplete when certain units of the material are 
omitted entirely, and be subject to duplication when certain units of the material 
are included more than once. Thus a ration-card list in which certain individuals 
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were nòt included, and others were included twice, would be both incomplete 
and subject to duplication. 

A frame may be termed inadequate when it does not cover all the categories 
of the material which it is desired to include in the survey. Thus a ration-card 
list which did not include the temporary residents in a district would be 
inadequate for a survey of the population of that district in which it was 
necessary to include such temporary residents. 

A frame, though accurate, complete, and free from duplication at the time 
it was constructed, may no longer be so at the time it is required for use. Such 
frames may be said to be out of date. Errors of all the first three of the above 
types may be introduced through the frame being out of date. 

These different types of defect have very different consequences in the 
defects they introduce into the sampling process. Inaccuracy in the frame, 
in so far as it relates to the selected sampling units, will automatically be 
discovered and corrected as the survey progresses, and consequently will not 
invalidate the results. If the information contained in the frame has been used 
as a basis of stratification, etc. or as supplementary information, inaccuracy 
in this information will result in somewhat lower accuracy in the results, but 
the actual accuracy attained will be assessable from the results themselves. 

Incompleteness in the frame will not be discovered in the course of the 
survey itself, and to the extent to which a frame is incomplete the population 
or material will fail to be covered. Incompleteness is likely to be more serious 
than it appears to be at first sight, since it is often confined to units possessing 
some special characteristics, which mayın: consegucnee be seriously under- 
represented in the sample. Duplication has a similar effect, since the dupli- 
cated units will have a double chance of being included in the sample. There 
is the difference, however, that incompleteness cannot be determined or set 
right by an examination of the frame itself, whereas duplication „may under 
certain circumstances be detected and corrected by such examination, though 
this will almost always be a tedious operation. If the sampling fraction is 
large and the degree of duplication is also large, the duplication may come 
to light in the course of the survey- Thus, with 5 per cent. duplication and a 
sampling fraction of 1 in 10, two out of every 210 units in the sample will on 
the average constitute a duplicate pair. ponr ae pling i Be Sei ee, 
however, only two out of every 2100 units in the sample will constitute a 


duplicate pai 
pair. a s may be incomplete f 
A frame which is inaccurate for certain purposes may SPR S a 
me which is ina hich some of the single women were 


others. Thus a ration-card list in W Z 5 
described as married would be complete, though at et T wee ee! 
for a survey of all women, but would be incomplete if used as a frame for the 


survey of single women only. Such incompleteness could be remedied by 
taking a sample covering all women, and rejesin those members of the sample 
rho were fi vestigation to be married. i 

i Belen mg eT usually be known before the survey is under- 
taken from the specification of the frame itself. Inadequacy can in general 


61 


SECT. 4.9 SAMPLING METHODS FOR CENSUSES AND SURVEYS 


only be dealt with by the construction of a subsidiary frame for the omitted 
categories. 

In actual practice, frames are likely to suffer to a greater or less extent 
from all of the above defects. It is therefore essential at the outset of the survey 
to carry out a careful investigation of any frame it is proposed to use, since 
many defects are not at all apparent until a detailed investigation has been 
made. Such an investigation will naturally commence with a study of the 
administrative machinery by which the frame has been constructed and by 
which it is kept up-to-date, but may also have to include a certain amount 
of field work. 


4.9 Frames suitable for censuses and surveys of human populations 


Human populations have a tendency to aggregate in towns and villages, 
often with very high local densities, which makes any form of area sampling 
based on maps and plans subject to high variability, unless a very elaborate 
sampling procedure is adopted. This is most serious if the total numbers 
are not known, and require to be estimated from the sample, but even the 
proportions of the population falling in different categories will be subject 
to substantial errors, since different classes of the population tend to be concen- 
trated in different areas. 


Three very different types of surve 


y of human populations may be 
distinguished. These are: rte as 


(1) Surveys of the census type, requiring the collection of relatively 
simple facts, but covering the whole population, and capable of giving 
separate results for small administrative areas. 

(2) Surveys covering the whole population of a country, and capable of 
giving reasonably accurate estimates for the whole population, and 
possibly for certain broad subdivisions, but not for small administrative 
areas. Such surveys often involve the collection of more detailed 
and elaborate information than do those of type (1). 

(8) Local surveys covering a particular town or rural area, or a few 
contrasted towns or rural areas, in which no attempt is made to obtain 
a sample which is fully representative of the country as a whole. Such 
surveys almost always involve the collection of detailed information 
by field investigators. They are usually investigations of a research 
nature, and may be precursors of simplified surveys on the same 
problems covering the whole country. 


Surveys of the first type present relatively simple sampling problems, 
and relatively complicated administrative problems. The sampling, since 
it has to cover small subdivisions of the population, must generally be single- 
Stage, usually with stratification and a uniform sampling fraction. Surveys 
of the third type are also relatively simple ; since only limited areas have 
to be covered, a one- or two-stage sampling process usually suffices. 
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Surveys of the second type, however, present much more difficult sampling 
problems, and also give much greater scope for increase in efficiency by the use 
of the more elaborate sampling methods. Since results are not required for 
small areas, administrative or other areas can form the first stage of a multi- 
stage process, thus enabling the sampling to be concentrated in relatively 
few areas instead of being spread over the whole country. This condition 
is absolutely necessary when field investigators are used. 

In fully developed areas a good deal of prior information on administrative 
areas is usually available. This often enables the accuracy of sampling at 
the first stage, which in general is the stage which contributes most to sampling 
error, to be substantially improved by the judicious use of stratification, 
supplementary information, etc. The sampling problems of this second type 
of survey are discussed in more detail in Section 4.18. 

Frames suitable for the sampling of human populations may be broadly 
classified as follows : 

(a) Lists of individuals in the population, or parts of it, provided for 


administrative purposes. 
(b) Aggregates of census returns resulting from a complete census. 


(c) Lists of households or dwellings in given areas. 


(d) Town plans. 

(e) Maps. of the rural areas. 

(f) Lists of towns, villages, an 
tary information of various types. 

ons we will give a brief description of the out- 

ling point of view, of these various types of 


d administrative areas, often with supplemen- 


In the following secti 
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the size of the household, which is rarely what is required. If such a method 
of selection is for any reason used, the results must be weighted in inverse 
ratio to the number of people in the sampled households included in the list 
(Section 6.16). 

Examples of administrative lists are provided by the National Registration 
and ration-book lists maintained in the United Kingdom. The National 
Register, which was instituted in 1939, has probably always constituted a 
reasonably accurate register of the whole population, but in its early stages, 
at any rate, it was very defective as a local register, owing to the failure of 
individuals to register changes of address. This defect was later rectified 
by establishment of joint offices with the food offices, and insistence that any 
applicant for a new ration book should first have his identity card amended. 
This, however, did not ensure immediate registration of local changes of address, 
since new counterfoils were only required if the removal necessitated change 
of shops. Consequently local changes were often only registered at the time 
of the regular yearly issue of ration books. 

The card index of the ration-book issues necessarily suffered from similar 
defects. Consequently neither of these registers formed a suitable frame for 
the sampling of small administrative areas, such as a single food-office district, 
particularly during the war when movements of population were frequent 
and considerable owing to air raids. On the other hand, they were and are 
capable of serving as a reasonably adequate frame for a sample census of the 
whole population. 

The food-office card index was used as the frame for the 1946 Family 
Census of the United Kingdom. This census was carried out by the Royal 
Commission on Population, with the object of providing, for married women, 
information on age, age at marriage, number and dates of birth of all children 
and husband’s occupation, information which had never previously be . 

f i y been 
collected in full. A sample of 1 in 10 of all the married women (includin 
those widowed or divorced) was taken, by examining every tenth card zi 
recording the name and address if the card was for a female adult ita the 
prefix Mrs. or with no prefix. Unmarried women selected by this process 
were requested to mark the questionnaire “ unmarried.” Questionnaires 
were dispatched by post, and collected by subsequent visit. 

Since there is necessarily a time lag in cancellation of the old food-office 
card on removal—this is effected by notification from the food 
the new counterfoils—special steps had to be taken to deal with removals, 
This was effected by fixing a “ zero ” date at an interval prior to the date when 
the sample was taken. The interval was chosen so as to be somewhat lo 
than the time taken for notification of change of address to be received a 
old office. Thus virtually all duplicate cards corresponding to changes of ad 
prior to the zero date would have been rem 
the zero date were excluded from the sample, and all cancellations be: 
a date of re-registration subsequent to the zero date were sampled, the 

address being recorded. It will be seen that by this procedure all individ 
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entered into the sampling frame once and once only, and the only individuals 
for whom incorrect addresses were recorded were those who had for some 
reason delayed their re-registration, or for whom notification of change of 
address had not yet been received by the old office. This procedure avoided 
all duplication except in the rare event of excessive delay in notification between 
offices, while obviating the necessity of any attempt to construct a fully up- 
to-date non-duplicated index. 


4.11 Frames from complete population censuses 


A complete census, in so far as it really is complete, will automatically 
provide an aggregate of forms which includes all the individuals in the area 
covered by the census. Nevertheless complete censuses, although they would 
t first sight to provide very satisfactory frames, have a number of 
A complete census by its very nature can only be carried out at in- 
frequent intervals, e.g. every ten years, and consequently the frame provided 
by such a census is for the greater part of its existence badly out of date. The 
way in which the census information is customarily collected and analysed 
also tends to reduce its utility as a frame, since the information is not readily 
accessible, at least in the early stages during which it is being transferred to 
ds. One of the great advantages of the food office register used 
in the Family Census was that the cards could be consulted and sampled 
in the local offices’ without serious disturbance of the office routine. 

Many of these disadvantages can be overcome if at the time a complete 
census is undertaken arrangements are made to construct a proper master 
sample from which further samples can be drawn as required. The sampling 
unit for such a master sample should be the dwelling, and not the individual 
or household occupying that dwelling at the time of the census. If dwellings 
are adopted as sampling units the master sample will have a much greater 
degree of permanence than would be the case if individuals were used as sampling 
units. Furthermore, for most purposes 4 sample of households, and not of 
individuals scattered over all households, is required. 


A complete census will provide a very suitable frame for a simultaneous 
sampling census in which more detailed information is collected on a sample 
of the population. This procedure was used in the 1940 census of population 
in the U.S.A. (Stephan et al., 1940, C). In this census supplementary questions 
were asked of 1 in 20 of the individuals included in the complete census, at the 
same time as the main census information was eollecied: The procedure was 
thus analogous to Gro Pa sampling, with the exception that the first phase 

i 3 ation. 
con g firen ele En a 90 individuals for the collection of the supplemen- 

Eicon was done on the spot by the field investigators, certain very 
tary i les had to be instituted in order to avoid bias. The actual procedure 
DeetOua yn s forms each contained lines for 80 individuals, 


. The censu 3 5 
ae yi aoa Two of the lines on each side were specially marked. 


appear a 
defects. 


punched car 


65 3 


SECT. 4.12 SAMPLING METHODS FOR CENSUSES AND SURVEYS 


Five different types of form were used, with the marked lines distributed in 
the manner shown in Table 4.11. 


TABLE 4.11—SaMPLING LINE NUMBERS IN THE 1940 U.S. POPULATION CENSUS, 
AND THEIR PROPORTIONS 


| 
| 


Style Proportion | Line numbers 
ua 16 14 29 55 68 
w | 1 1 5 41 75 
K | 1 2 6 42 17 
na Bl 1 3 39 44 79 
zZ 1 4 40 46 80 


The investigators were instructed to enter the names of each family in a 
defined order, and to complete all lines of the form before commencing a new 
form. Actually these instructions were not always adhered to, 3} per cent. 
of the last lines (Nos. 40 and 80) being found to be blank. If the blanks extend 
over the earlier lines, which are not marked on the W-Z forms, but not as far 
as the lines marked on the V form, this will lead to a slight deficiency in the 
proportion in the sample of entries in line 1 and the other lines marked on 
the W-Z forms. This disturbance, however, is only very small, but any tendency 
of the investigators to alter the order in which the names were entered so as 
to secure a suitable person for supplementary questioning could easily give 
rise to more serious biases. 

The danger of this type of bias is always present in this method of sampling, 
and can only be overcome by the most rigorous training of observers, and 


the imposition of rules which determine uniquely the order in which names 
are entered on the list, 


4.12 Frames from lists of households or dwellings 


Lists of households or dwellings are frequently available from such 
sources as rating offices, electoral Tegisters, etc. Frames based on such lists 
are in many ways preferable to frames based on lists of individuals. As already 
mentioned, in most surveys in which the information is collected by personal 
visit it is advantageous, and often essential, to collect information from all 
members of a household, in other words to use households as sampling units. 

Frames consisting of lists of dwellings also have a much greater degree 
of permanence, being unaffected by movements of the Population. Such 
frames, if complete at the time of their construction, will only become incomplete 
to the extent that there is new building, or changes in the use of existing 
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buildings. New building is necessarily a slow process, and the listing of new 
buildings usually presents no serious difficulties. 

Lists of households can generally be utilized to give a frame of dwellings 
by taking as the sampling units the dwellings occupied by the households at 
the time the frame was constructed. Certain special precautions are required 
to ensure the inclusion of dwellings which were unoccupied at the time the 
list was prepared. In a town in which the dwellings are arranged in streets 
and in which the list is also arranged by streets this presents no particular 
difficulty. What has been called the half-open interval can be used. The 
procedure is as follows. When drawing the sample, the dwelling unit appearing 
next in the list to the selected unit is recorded, and the field investigator is 
instructed to see if there is any other unit on the ground between these two units, 
and if so to include that unit inthe sample. Thus the field investigator might 
receive the instruction to survey No. 9 in a certain street, with No. 13 as the next 
recorded unit (odds only). If on visiting No. 9 he finds that No. 9A and No.11 
also exist, these are also surveyed. The even numbers between 9 and 13 are 
not included, since the instruction “ odds only” indicates that they lie on the 
opposite side of the street. This procedure is clearly only possible if the list 
is arranged in an order which corresponds to some geographical pattern on 
the ground. If the list is not so arranged, incompleteness of the frame cannot 
be corrected by the use of the half-open interval or analogous procedure. In 
such cases the frame will have to be amended by other means, and complete 
rearrangement of the list in some geographical order may be necessary. 

An example of the use of this type of frame is provided by some surveys 
carried out during the war in certain towns in the United Kingdom by the 
Ministry of Home Security, to investigate disturbances to the population on 
account of air raids. In the English towns electoral registers were used 
as frames, and in the Scottish towns rating lists were so used. The electoral 
registers consist of printed lists of voters arranged by streets in order of 
dwellings, all voters in one dwelling appearing together. Each dwelling therefore 
has as many entries as there are voters. Consequently, selection of entries in 
the list with equal probability will not give an equal probability of selection 
in the different dwellings. This could have been overcome by subdividing 
the list into dwellings, and basing the sampling on these dwellings. : h 

As the surveys had to be conducted at considerable speed, delay in selection 
of the sample was avoided by the device of examining every xth entry, and in- 
cluding the dwelling in the sample if the entry referred to the first listed member 
in the dwelling. This introduces a certain additional discrepancy between 
the working sampling fraction, 1/x, and the actual fraction of dwellings included 
in the sample, which introduces errors that are appreciable relative to the 
sampling errors for estimates of such quantities as numbers in the population 
obtained by multiplying the sample total by x. Such estimates were not the 
primary concern of these surveys, and consequently no adjustments were 
required. If necessary, errors arising from this cause could have been eliminated 
subsequently by ascertaining the ratio of the number of dwellings included 
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in the sample to the number in the whole register, and treating this ratio as 
the true sampling fraction. 

In this survey the method of the half-open interval was used to deal with 
dwellings not included in the list, and was found to be quite satisfactory. 
Had there been new housing estates not covered by the register these would 
have had to be dealt with separately. 

In certain towns the separate flats of blocks of flats were not listed, and 
therefore presented a difficulty, since the blocks constituted very large 
units whose chance inclusion or exclusion would have materially increased 
the sampling error. The existence of blocks of flats was, however, always 
apparent from the large number of voters appearing under the same address. 
The blocks were therefore listed, together with other large institutions, by 
preliminary inspection of the register, and every wth flat was selected by visit 
to all the blocks in turn. 


4.13 Frames provided by town plans 


Town maps and plans provide a useful frame for the sampling of dwellings 
in built-up areas. In some cases there may be detailed maps showing the 
location of all dwellings, but in many cases only street plans, 
any great amount of detail, will be available. 

Any town plan which gives an accurate representation of the streets will 
enable the town to be divided up into “ blocks,” i.e. areas bounded by streets, 
Such a plan will, therefore, provide a frame for area sampling in which the 
units are blocks. A sample of dwellings can then be obtained by including 
all the dwellings in the selected blocks. In general, however, the variability 
between block and block is likely to be large even after careful stratification, 
since there is often considerable local segregation of different classes of the 
population. Consequently, two-stage sampling is in general advantageous, 
blocks being taken as the first-stage units and dwellings as the second- 
units. 

To obtain a two-stage sample in cases in which the map does not show the 
location of dwellings, it will be necessary to construct the second-stage frame 
for the blocks selected at the first stage by ground survey. This, however, 
is a much lighter task than the construction by ground survey of a frame for 


all Tis in the city, and can frequently be done in the course of the survey 
itself. 


In towns in which the natural blo 
of the smaller blocks or subdivision 
so as to reduce the within. 


not showing 


stage 


cks are of very unequal area, groupings 
on of the larger blocks should be performed, 
-strata inequalities in area. If little is known about 


68 


PROBLEMS ARISING AT THE PLANNING STAGE SECT. 4.14 


estimated numbers of dwellings. This was done in parts of the Greek population 
census described in Section 4.16. If such preliminary estimates are not 
available the best that can be done is to make a sélection with probabilities 
proportional to area, but block areas are unlikely to be very closely correlated 
with the number of dwellings, even within strata. In either case the second- 
stage sampling fractions may be taken inversely proportional to the first-stage 
fractions, so as to give a constant overall sampling fraction. 

Whether an elaborate procedure of this kind is needed depends not only 
on the accuracy required but also on whether estimates of total numbers are 
required from the survey. Since total numbers will be highly correlated with 
numbers of dwellings, prior supplementary information on these numbers 
for the different blocks, even if only rough, will be particularly effective in 
reducing the sampling variability of estimates of total numbers. They are 
not likely to have such large effects on estimates of the proportions of the 
population falling in various categories. 

Sampling by streets is sometimes used instead of sampling by blocks. 
This is usually not so satisfactory as sampling by blocks, since each block repre- 
sents a clearly defined area, whereas if a street is taken, there is often doubt 
as to exactly what is to be included and what excluded: alleyways and court- 
yards having entrances from more than one street, and not shown onthe street 
map, for example, present considerable difficulty if the sampling is by streets. 

Sampling by blocks is particularly valuable for surveys of towns in which 
all types of building have to be covered. Second-stage sampling of any or all 
of the different types of building can be adopted if required, by enumerating 
the different types for the selected blocks after the first-stage sample has 


been drawn. 


4.14 Frames provided by maps of rural areas 


The use of maps as a frame for the sampling of rural areas presents some- 
what different problems from those encountered in the sampling of towns 
by the aid of town plans. 

If accurate and detailed maps showing all or virtually all buildings are 
available, rectangular areas may be used as sampling units, the buildings falling 
in the selected areas being examined on the ground to see whether they are 
dwellings, with a further examination for unmapped dwellings. 

Sampling with probability proportional to the apparent number of dwellings 
indicated by the map is possible, but would involve counting the dwellings 
in all the rectangular areas. Consequently it is better, if preliminary work 
on the maps of this magnitude appears to be worth while, to divide the map 
into areas containing approximately equal numbers of dwellings, using natural 
boundaries as far as possible and paying particular attention to stratification. 

It may be noted here that the selection of a point at random on the map 
and selection of the dwelling unit nearest to this point for inclusion in the 
sample—a method which is sometimes used—is inadmissible, since a unit 
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which is widely separated from other units will have a much greater chance 
of selection than one which is close to other units. 

With less detailed maps rectangular areas marked on the map will not be 
capable of being demarcated exactly on the ground. Natural features occurring 
on the maps must therefore be used as boundaries of the sampling units. This 
will necessarily give units of differing size. In particular, occasions will arise 
when it is impossible to subdivide a somewhat large area. In such cases, the 
area in question may be taken to represent two or more units, If any of these 
units are selected, a subdivision into two or more parts as alike as possible 
is made on the ground at the time of the survey, and the requisite number 
of parts selected by random choice. 

In most countries, even in rural areas, there will be a number of villages 
of varying sizes, which are best dealt with separately by some form of 
stratification and two-stage sampling, since if these are included in the area 
sample a high degree of variability will be introduced. The use of a variable 
sampling fraction at the first stage, a larger proportion of the larger villages 
being selected, will be advantageous. A compensating reduction in the second- 
stage sampling fraction can be made if desired. The boundaries of all villages 
will require careful demarcation, as otherwise there will be ambiguity as to 
what should be included in the area sample. 


4.15 Frames from lists of villages 


In undeveloped areas the available maps are not likely to be of sufficient 
accuracy for area sampling. Where the population is concentrated in villages, 
these usually form the best first-stage sampling units. A list of villages will 
then serve as a suitable frame. 

Even if the majority of the population is concentrated in villages there may 
be a residue located in the intervening countryside. If this residue owes 
allegiance to definite villages, the problem is relatively simple, since all that 
is required is the identification of the individuals belonging to the selected 
villages. This can normally be done by the head-men of these villages, 

If no such association exists, some form of area sample of the intervening 
areas may be necessary. If rough maps are available, suitable areas may be 
demarcated by tracks, rivers, etc. If no maps are available, some form of line 
sampling may be possible in open areas. 


If the country is not sufficiently open to be easily traversed, the construction 


access included up to half-way to the next li 
village. Such a method will only be effective with a telatively simple track 
system such as is met with in forest areas: intermediate junction points, for 
example, present special problems. z 
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4.16 The 1946 population sample for Greece 

The 1946 population sample for Greece provides a good example of the 
way in which sampling methods of the types discussed in the previous sections 
can be used to obtain speedy census data from a sample covering the whole 
population of a country. (Jessen et al., 1947, C). * 

The sample was taken by the Allied Mission for Observing the Greek 
Election, as part of the investigation of the accuracy of the electoral lists. A 
population sample was required in order to test for omissions from the electoral 
lists, and the opportunity was therefore taken of securing more general census 
data. The corresponding test for duplications and redundancies in the lists 
was made by examining the lists themselves and investigating a sample of 
names drawn from them. 

The frame for the first stage of the population sample was that given by 
the 1940 Population Census. This census gave returns for koinotetes, which 
are small communities or groups of villages, and demoi, which are towns and 
cities, usually with more than 10,000 population. Maps were available which 
showed the areas included in these koinotetes and demoi, and the names and 
location of all the populated centres. The koinotetes and demoi were used as 
sampling units at the first stage of the sampling. The units were stratified 
according to their population in 1940, and a variable sampling fraction was 
used. The actual scheme is shown in Table 4.16. Selection from within 
strata was systematic. 3 T 

The sampling of the selected first-stage units was based on lists of house- 
holds within the area. These lists were either based on existing lists checked 
and brought up to date, or were specially prepared to show the location of the 
households on a map. Pi 

For the sampling of towns an additional stage was used, a sample of 
blocks demarcated on an existing or a constructed street plan being first taken, 
with a further sample of houses from within the selected blocks. Sampling 
was sometimes with probability proportional to estimated numbers of house- 
holds, these estimates being obtained by a rapid cruise of the whole area, and 
sometimes with equal probability. 1 Á 

The sampling fraction at the final stage was in all cases adjusted so as to 
give a constant overall sampling fraction. When blocks were sampled with 
probability proportional to estimated numbers of households, this required 
that the sampling fraction within the selected blocks should be taken as inversely 
proportional to the estimated number of households in the block. Thus the 
parish of Agios Panteleemous, which is given as an example, was initially sub- 
divided into 98 blocks. Before sampling, some of the smaller blocks were 
combined so as to give 65 combined blocks. f The total of the estimated 
number of households was 966. It was decided to sample three blocks, which 
were selected systematically by taking a sampling interval of 322 (== 966/3) 

* See also U.S. Dept. of State publ. 2522 (1946, D’) and Jessen et al. (1949, D’). 

+This procedure is the same as that used in the sampling of Hertfordshire farms 


by parishes, Section 3.11. 
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with a random starting point of 288, using sub-totals in the manner of 
Example 3.2.c. This gave combined blocks containing 23, 18, and 13 
estimated households respectively. Since a sampling fraction of 1/100 within 
the parish was required, the estimated number of households in the sample 
was 966/100 =10. This number was divided approximately equally between 
the three selected blocks (4, 3, 3), and the sampling intervals were calculated 


TABLE 4.16—GREEK POPULATION CENSUS : SUMMARY OF THE SAMPLE DESIGN 


Sampling ratios Number of places 
For 
Size- 5 Assumed selection 
class Population average For of names In In 
code in 1940 population | selection and size- | sample 
in 1940 of sample | households class 
places within a 
sample 
place 


For koinotetes 


1 0- 499 350, 1/100 1/5 2,147 20 
2 500- 999 750 1/50 1/10 2,049 40 
3 1,000-4,999 2,500 1/20 1/25 1,366 70 
4 5,000 and over 7,000 1/5 1/100 54 10 


TOTALS 5,616 140 


For demoi 


5 Under 25,000 17,000 1/2 1/250 52 26 
6 25,000 and over — 1/1 1/500 22 22 
TOTALS 74 48 


by dividing the estimated numbers in the blocks by these numbers, i.e. the 
intervals were taken as 23/4 = 6, etc. This procedure gives the required 
constancy in the overall probability of selection. The actual number of houses 
in the sample will of course differ from 10 if the estimated numbers are in 


error: it is the overall probabilities of selection, not the numbers of houses, 
that must be fixed. z 
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The ratio method of estimation was adopted, using the 1940 population 
data as supplementary information. The actual method used was that appropriate 
to sampling without stratification with probability proportional to size of 
unit (Section 6.16), as this was considered to be the most accurate. There 
does, however, appear to be some danger of the introduction of bias by this 
method, and the unbiased method appropriate to a stratified sample with 
variable sampling fraction (Section 6.11) might have been preferable. 

The survey was very successful and achieved high accuracy, the standard 
error of the estimate of the total population being estimated to be + 2-1 per 
cent. The field work occupied 65 observer teams, each consisting of an observer, 
an interpreter and a driver, with a jeep, for three weeks. The entire sample 
and the computations were completed in 7 weeks. 


4.17 Master samples 

When a number of surveys covering the same population or aggregate 
of material are likely to be required, it is sometimes advantageous to construct 
a master sample, from which smaller samples can be drawn as required by means 
of a sub-sampling scheme. 

The use of a master sample has a number of advantages. It enables a more 
accurate, complete and adequate frame to be constructed than could be justified 
ere only required for a single survey. It simplifies the selection 
in the sub-sampling only the material contained in the master 
sample has to be subjected to the selection process. It enables supplementary 
information to be obtained which is of value in improving the accuracy of the 
various surveys. And it enables surveys on the same material to be so planned 
that the same units are not selected an excessive number of times for different 
surveys—-a matter of some importance when the information is obtained by 


if the frame w 
of samples, since 


response to questionnaires. 
The most extensive and elaborate master sample so far constructed is the 


master sample for agriculture of the United States of America. The construc- 
tion of this sample was undertaken by the Statistical Laboratory of Iowa State 
College, in co-operation with the Bureau of Agricultural Economics and the 
Bureau of the Census (King and Jessen, 1945, G). 

The sampling units of this master sample consist of small areas covering 
the whole of the United States. The units have a mean area of about 2-5 
square miles, but vary according to location and other circumstances, the 
mean area per state ranging from 0-71 square miles to 108 square miles. They 
in on the average 4, 5 or 6 farms, depending on the 


were formed so as to contain í 
part of the country. One-eighteenth of all the areas were selected for the 


‘master sample. 3 

The whole of the land area of the United States was divided into three 
categories, called in the master sample “ primary strata.” These primary 
strata are (1) the incorporated stratum, (2) the unincorporated stratum, (3) the 
open-country stratum. The incorporated stratum consists of incorporated 
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cities and towns and unincorporated places regarded as “urban” by the 
Bureau of the Census. The unincorporated stratum consists of all named 
places outside the incorporated areas which have an estimated population 
of 100 or more, and all other areas which appear on the map and have a 
population density of 100 or more persons per square mile. 

The incorporated areas were defined by the corporate boundaries, of which 
the location could be obtained. The unincorporated areas were demarcated 
on the maps so as to give areas as compact as possible, while including every- 
thing that did not appear to be open country. Subject to this, the boundaries 
were chosen so as to be easily identified on the ground. Aerial photographs 
were used in some cases for this work. 

The general highway and transportation maps showed with varying degrees 
of accuracy the location of farms and other dwellings in the open-country 
areas and to some extent in the smaller unincorporated places, and these were 
therefore used to demarcate the actual sampling units of the open-country 
stratum. The procedure was as follows.. The numbers of farms and non- 
farm units were first counted in what are termed. “ count units” from the 
map. A count unit consists of a unit defined by minor civil boundaries or 
natural boundaries, and in general included from 6 to 30 farms. These count 
units were numbered, and the number of farms and the total number of dwellings 
including farms were marked on the map. The number of sampling units into 
which each count unit was to be subdivided was also decided and noted on 
the map ; in making this decision, consideration was given to the prevalence 
of natural boundaries, etc. The data for each count unit were then recorded 
on punched cards, and cumulative totals of the farm count, the dwelling count, 
and the number of sampling units, were tabulated. These cumulative totals 
were used to determine the count units which contained selected units, a random 
number between 1 and 18 being chosen as a starting-point, and the count unit 
containing every 18th sampling unit being selected thereafter. The count 
units containing selected sampling units were next subdivided on the map 
into the specified number of sampling units, the subdivisions being so chosen 
that they could be located on the ground. The units so demarcated were 
then numbered or counted systematically and the appropriate sampling units 
selected. Existing aerial photographs were used extensively for the demarcation 
of boundaries. In cases in which there were no suitable natural boundaries 
on the maps or photographs two or more units were amalgamated, subdivision 
and random selection being subsequently made on the ground if either unit 
was selected. 

Somewhat different procedures, which need not be detailed here, were 
followed for the unincorporated and incorporated strata. For the incorporated 
stratum information was obtained from the Bureau of the Census on numbers 
of farms, etc. 

In its final form the master sample will provide an adequate master sample 
of both farms and population, and also of the land area of the whole of the 
United States. Because the sampling units consist of “areas, the frame will 
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remain complete and adequate whatever changes occur in the course of time. 
The supplementary information provided by the number of farms and number 
of dwelling units will naturally become progressively more inaccurate, but 
major changes are likely to take place only in limited areas, and the master 
sample will in the course of its use reveal the extent of these inaccuracies. 
There will, therefore, be no difficulty in revising the sample when it appears 
necessary for those areas of the country where extensive changes have occurred, 
and this will in no way invalidate the existing sample for the rest of the 
country. 

It will be seen that the construction of a master sample of this type is a major 
undertaking, and it should not be assumed that a master sample of the same 
type is necessarily expedient in other countries in which the conditions are 
different. Thus in the United Kingdom the 6-inch Ordnance Survey maps 
provide an excellent frame for land area surveys, and the register of farms 
which is maintained for the collection of agricultural statistics provides a very. 
complete frame of farms. If a master sample for agriculture is ever considered 
uction could be based on this register and on the associated 

The task of construction would therefore be very much 
simpler than would be the case if no such register existed. On the other hand, 
there is a need in the United Kingdom for an adequate master sample for 
localized population surveys. This problem is discussed in the next section. 


necessary, its constr 
returns of farmers. 


4.18 Localized population surveys 


been indicated in Section 4.9, surveys are often required 
bly accurate estimates for the country as a whole, but 
l] administrative districts. Such surveys have to be 
concentrated in a few localities, particularly if they are to be carried out by 
field investigators, since the amount of travelling would otherwise be excessive 
and supervision difficult. They may therefore be termed localized surveys. A 
multi-stage process must be used, the units at the first stage being administrative 
districts or similar areas of such size that each selected unit is capable of being 
covered by a single investigator or a small team of investigators. 

The crux of the problem, therefore, consists of so planning the primary 
stage of the sampling process that the sampling error at this stage is not excessive. 
A secondary consideration, which must not be ignored, is that the within- 
strata comparisons should be sufficiently numerous to furnish a reasonable 
estimate of the sampling error at the first stage. ; Br 

The use of stratification is obviously indicated. This stratification must 
in the first instance serve to differentiate between urban and rural areas. 
Consequently the country should be divided into large cities, into smaller 
urban areas, and into rural areas, in a manner somewhat similar to that employed 
in the master sample of the United States. The number of classes required 
will depend on the character of the towns and rural areas. A variable sampling 
fraction will be required in association with this stratification ; for most surveys 


As has already 
which will give reasona 
not for the separate smal 
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it will probably be necessary to take all of the very large towns, but a proportion 
only of the intermediate towns, a smaller proportion of the smaller towns, 
and a still smaller proportion of the rural areas. Regional stratification of the 
smaller towns and rural areas may also be adopted as far as possible in parallel 
with this stratification. 

These two types of stratification by themselves, however, are not likely 
to be entirely adequate for the urban areas, and some further form of stratifica- 
tion may be sought which will ensure (a) reasonably correct proportions of 
areas of different industrial types, and (b) reasonably correct proportions of the 
different social classes. 

The methods by which it may be possible to ensure this will vary greatly 
according to the nature of the country, the type of primary unit that is adopted, 
and the amount of information that is available on these primary units. Ad- 
ministrative areas are usually most suitable from the point of view of the amount 
of readily available information, but they are not always ideal from the sampling 
point of view. As far as the United Kingdom is concerned, administrative 
areas appear to be the only possible type of area which can be used without 
a great deal of preliminary work. They will probably prove reasonably satis- 
factory if the boroughs and urban districts associated with the large towns 
are treated as parts of these towns, and sampled fairly intensively. Thus, 
for example, the sampling of the various parts of London and of its satellite 
suburban towns should be considered as a special problem separate from that 
of the sampling of the smaller towns in other parts of the country. 

The second-stage sampling of the selected first-stage units is not likely 
to present any very serious problems. In the very large towns such as London, 
and in dispersed rural areas, two or more stages are likely to be required to 
avoid excessive travelling. Adjustment of the sampling fraction at the final 
Stage to give equal overall sampling fractions is often advisable, since estimates 
can then be rapidly and simply obtained. Provision at the final Stage for a 
proper rota of households to be included in the different samples, so as to avoid 
using the same household too frequently, is also of importance. 


Much further research work remains to be done before it can be said with 
certainty whether a sample of this nature covering the United Kingdom is 
likely to be satisfactory for all purposes, or whether samples having a different 
structure will be required for different purposes. The importance of 
investigating the possibility of obtaining such a sample is clear. Without it, 
localized sociological and economic surveys of the general population cannot 


be carried out with any high, and at the same time ascertainable, degree of 
accuracy. 


4.19 The U.S. series of employment estimates 


An early example of a localized 


a sample is that set up i i 
find900 we N Mes p et up in the United States 


dy statistics on unemployment, employment 
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and the labour force (Frankel and Stock, 1942, F). The sample was modified 
and improved in 1943 (Eckler, 1945a, F). 

In the original sample, counties were used as the first-stage sampling units. 
All the 3097 counties of the United States were classified and sampled as 
follows :— 


Total No. Percentage of No. of counties 

of counties population in sample 
Cities 9 14 9 
Urban 447 50 28 
Rural 2641 36 27 

3097 100 64 


The 9 city counties relate to the 5 largest cities ; all these 9 counties were 
included in the sample. The urban counties are those with 1930 populations 
of 45,000 and over. In the urban and tural classes a triple stratification, each 
of three strata, was adopted, the bases of the three stratifications being 
population, administrative areas, and percentage unemployed. Divisions 
between each of the main strata were so chosen that approximately equal 
numbers of counties fell in each main stratum. There were thus 27 sub-strata 
for both the urban and rural classes.* One county was selected from each 
of these sub-strata at random, with one exception where two counties were 
pee two-stage process was used to sample the urban and rural areas 
within the selected counties. The numbers of households to be selected from 
the various urban and rural areas were allocated on the basis of the census 
population figures for these areas. ‘This led to the gradual development of a 
differential bias between urban and rural areas, owing to a drift of population 
away from the rural areas. d i 
ithin a single county were aggregated without any 


The results from wi eee di h 5 
weighting. The aggregates were then weighted according to the population 


of the stratum from which they were obtained. p 
The within-county sample was changed. every 4-6 months. This was a 


compromise between having a constant group of households, which would 
give most accurate estimates of monthly changes, and having new households 
on each occasion so as to avoid repeated visits to the same household. It 
introduced a certain discontinuity into the results, which has been avoided 
in the modified sample by using a proper system of partial replacement of the 


type descri in Section 3.18. 
ZET P tees sample, which included 68 first-stage units, allocation of 


households on the basis of population figures was abandoned. Instead, small 
areas were used as units at the second stage. This eliminates bias resulting 
from population drift. 


* It will be noted that the nu 
were not by any means equal. 


mbers of the counties in the different sub-strata 
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Several other features were also introduced in order to improve the accuracy 
of the results. The stratification was more detailed, selection of the primary 
units was with probability proportional to their populations, and the ratios 
between the numbers of households having certain contrasting characteristics, 
e.g. farm and non-farm, were adjusted in each selected first-stage unit to agree 
with the corresponding ratios in the stratum to which the unit belonged. This 
last procedure is not entirely free from danger of bias. In a unit witha relatively 
small proportion of farm households, for example, those that do occur may 
be expected to be somewhat abnormal owing to proximity to non-farm areas, 
and such abnormal households will consequently receive excessive weight 
in the final results. 

In both these samples only a single unit was selected from each of the 
first-stage strata. Although this unquestionably increases accuracy by permitting 
the use of smaller strata, it has the consequence that no fully valid estimate 
of error is available. The best estimate is that obtained by combining the 
strata in pairs, and this is likely to be somewhat of an overestimate. 


4.20 Frames suitable for special classes of a human population 


Surveys of special classes taken from the whole of a human population 
are often required. If a general frame covering the whole population is 
available, it can be used for a survey of a special class by selecting a sample 
from the whole population, and rejecting those members which do not fall 
in the required class. If the frame itself does not contain the necessary 
information, this will necessitate surveying all units of the sample in order 
to find out which individuals are to be retained and which rejected. If the 
required class is only a small fraction of the whole population there will be 
a large proportion of rejects, and a disproportionate amount of work is there- 
fore required in such cases. 

Consequently, if a frame covering only the required class or classes is avail- 
able, this should be used in preference to a general frame. In surveys of the 
labour force, for example, it is often possible to utilize unemployment 
insurance registers and similar records. Such frames are often to a certain 
extent inadequate—all types of labour may not be included in an unemploy- 
ment insurance scheme, for example—but their greater convenience frequently 


outweighs their defects. Occasionally it may be considered advisable to cover 
the excluded classes with 
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4.21 Frames suitable for the survey of economic institutions 


Surveys of economic institutions may be divided into two general classes = 
those covering the whole or a large part of the commercial and industrial 
undertakings of a given town or country, and those covering a single type 
of undertaking or industry. 

In surveys of the former class the use of frames constructed from maps 
or plans is often feasible : thus, in a general survey of factories of a given town 
a town plan may conveniently be used, with area sampling from this plan. 
Since, however, most commercial and industrial undertakings vary greatly 
in size, a variable sampling fraction is often required, a larger proportion of 
the large undertakings being selected. A map does not provide a suitable 
frame for this purpose. Even for general surveys, therefore, it is often advisable 
to use a special frame for the large undertakings, excluding these from the area 
sample, which is used to cover only the smaller undertakings. In a survey 
of the factories of a town, for example, there is usually no great difficulty in 
drawing up a list of the larger factories. If necessary a preliminary ground 
survey, with or without the aid of detailed maps, can be made. 

If a particular type of undertaking or industry requires to be surveyed, 
the use of any form of area sampling will generally be unsatisfactory unless 
the units are small and widely dispersed, as, for example, occurs with retail 
shops. Even here shopping centres will require differentiation from other 
areas. In other cases a list of all the units of the given type of undertaking 
or industry will form a very much more suitable frame. In order that a variable 
sample fraction may be used it is important that such a list should contain 
some indication of the size of the units. Ifno satisfactory frame of this type exists 
it will often be worth while carrying out a complete census, simply for the 
purpose of constructing @ frame and collecting a few basic facts about the 
given type of undertaking or industry. Such a census is usually more effective 
if it is on a compulsory basis. ; , 

When a frame provided by a complete census is required for repeated 
surveys, the problem of keeping it up-to-date must be considered. This is 
usually best effected by keeping a register of the undertakings concerned, 
and making a regulation that requires all changes to be reported. For the 
purposes of sampling, the most important type of change which requires to 
be recorded is that of new entries. Failure to report other changes will merely 


result in inaccuracies in the frame. 


4.22 Market research and opinion surveys 


Market research includes not only investigation into consumer reactions 
to goods and services and to advertising campaigns, but also investigations 
into consumer needs. In the case of consumer reactions information is mainly 
required on opinions, while in investigations of consumer needs factual informa- 
tion will also be required. Market research surveys can therefore be carried 
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out in the same manner as sociological surveys of the questionnaire type, 
using the methods which have already been described. 

This is also true of other surveys of public opinion, but in many opinion 
surveys and also in certain types of investigation into consumer reactions, 
the requirements are somewhat different from those for sociological investiga- 
tions. Speed is often essential, and changes in the percentage of individuals 
holding a given opinion are frequently of more interest than the absolute value 
of the percentage holding that opinion at any one time. 

To meet these requirements, and to reduce costs to a minimum, what is 
known as the guota method of sampling has been developed. This method 
is a variant of purposive selection. Interviewers are given definite quotas of 
people in different social classes, of different age-groups, etc., and are instructed 
to obtain the requisite number of interviews in each quota. Additional 
instructions, which are designed to prevent excessively unrepresentative 
selection within the allotted quotas, may also be given on mode of contact, 
etc. The interviews themselves are sometimes carried out by house-to-house 
visits, sometimes by interviewing people in the streets and other public places, 
and occasionally even by telephone. 

It is clear that, however accurately the quotas are fulfilled, such samples 
cannot be regarded as the equivalent of random samples. Consequently the 
danger of bias is always present, and the quota method must therefore be ruled 
out as a suitable method of investigation for Precise enquiries in which unbiased 
results are required. Moreover, if there is a change in conditions, a quota 
sample which has previously adequately reproduced the characteristics of 
the Population may cease to do so. Consequently, the fact that a quota system 
has consistently given reliable results over a period of years is no guarantee 
that it will also do so in the future. 


The striking failure of public opinion polls to predict the results of the 


t to a much greater 
extent than usual, and it may well be that, in spite of the quota system, the 


samples were very deficient in factory workers and other trade union labour. 
The mere fact that a quota system is designed to give the correct proportion 
of workers, or even of different classes of workers, does not necessarily ensure 
that those included are representative, as regards the way they vote, of the 
workers as a whole. Consequently the results may be biased in elections in 
which the different types of worker vote very differently. 

On the other hand, if used with skill the quota method may give sufficiently 
accurate results in simple enquiries where only general indications of the 
opinions held are required. If the samples are taken in the same manner on 
different occasions, and circumstances remain broadly the same, 
provide a not-too-inaccurate measure of changes of opinion. 
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Apart from the problem of obtaining a representative sample, there is the 
inherent difficulty in opinion surveys that an individual’s opinion on a given 
subject is frequently both ill-defined and liable to change. Moreover, on certain 
subjects the respondents may be unwilling to voice their true opinions. 
Opinions are also held with very different degrees of intensity, which there 
is no easy way of measuring. Much of the information provided by public 
opinion polls is therefore of doubtful significance. 


4.23 Frames for agricultural censuses and surveys 


Agricultural censuses and surveys can be carried out in collaboration with 
the farmers, or in certain circumstances by direct observation without contacting 
the farmers. The latter method is in general only applicable to surveys of 
agricultural crops, and then only if all particulars required are ascertainable 
by inspection. For censuses and surveys of livestock the collaboration of the 
farmer is usually necessary, the essential difference being that livestock is 
mobile whereas crops are immobile. Collaboration is also obviously required 
if information relating to the farm as a whole is needed. In many countries 
contact with the farmer is advisable even for crop surveys, because exception 
may well be taken to the examination of a crop without the farmer’s permission. 

If a census or survey is to be conducted by contacting the farmer, the farm 
will usually form the sampling unit at some stage of the sampling process. A 
frame covering farms will therefore be required. Such frames are provided 
either by lists of farms, or by some form of area sampling which serves to locate 
the farmhouses. Frames based on maps, etc., which are suitable for the sampling 
of human populations in rural areas (Section 4.14) are equally suitable for 
the sampling of farms. 

If contact with the farmer is not necessary maps can be used directly as a 
frame for crop surveys. Their use for this purpose is discussed in the next 
section. Even in this case, however, farms may well provide the best available 
frame. . 

In crop surveys the natural unit for many purposes is the field and not 
the farm. In cases where it appears advisable to obtain information for some 
only of the fields ot a farm under a given crop, a further stage will have to be 
introduced into the sampling process. This inevitably results in a somewhat 
complicated sampling structure with different sampling fractions for the different 
parts of the sample, which in turn introduces complications into the analysis 
of the results, at least if unbiased estimates are required. 

An example of this type of survey is provided by the Survey of Fertilizer 
Practice, carried out in various counties of England and Wales from 1942 
onwards (Yates et al., 1944, G). The objects of this survey are to determine 
the way in which farmers manure the different crops, and the relation of this 
manurial practice to the fertilizer requirements of the soil, in so far as these 
can be determined by the current methods of chemical soil analysis. 

The method of selecting the samples is as follows. For each county a 
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systematic sample of farms is selected from the Ministry of Agriculture’s 
addressograph list, maintained for the purpose of collecting the agricultural 
statistics on crop acreages and livestock. This list is arranged alphabetically 
by farmer’s name and parish, and shows the total acreages (crops and grass) 
of each farm. A variable sampling fraction is used, with three size-groups, 
about 100 farms being selected from an average county. Larger samples 
are taken from counties which can be subdivided into districts containing 
different types of farming. \ 

Each selected farm is visited by a field investigator, who is a member of 
the Provincial Advisory Staff. All the fields of the farm are listed in consultation 
with the farmer according to their crops, and also according to whether they 
have been recently ploughed out from grass (new and old arable). In the 
earlier surveys one field of each crop was selected at random from all the 
old-arable fields, and similarly for all the new-arable fields. One permanent 
grass field was also selected. In the later surveys one field in three of each 
crop has been selected from each of these categories. From each group of 
selected fields one old-arable, one new-arable and one permanent grass field 
is selected at random, and soil samples taken for chemical analysis. 

For the selected fields information is obtained from the farmer on the 
cropping over the previous four years, and the amounts and chemical composi- 
tion of the fertilizers, farmyard manure and lime applied in each year of this 
period. In some of the later surveys only a single year has been covered. When 
necessary the fertilizer merchants are consulted in order to obtain information 
on the chemical composition of the fertilizers. 

The methods of analysis adopted in this surve 


y are illustrated in Example 
6.19. 


4,24 Use of maps as frames in agricultural surveys 


If accurate large-scale maps showing the field boundaries are available, 
the point method of sampling is very suitable for crop surveys in which contact 
with the farmer is not necessary. The fields will then act as sampling units, 
and selection will be with probability proportional to size. Provided the whole 
of a selected field is under a single crop, all that is necessary for acreage estimates 
is to ascertain the crop, no determination of area being required (Section 3.9). 
If more than one crop is being grown on a selected field, the proportions of 
the area under the different crops must be determined, but eye estimates will 
usually be adequate for this purpose. 

In this type of work two-stage sampling will often be advisable in order 
to save travelling, and also to avoid having to handle an excessive number of 
maps. Thus in the United Kingdom the 6-inch Ordnance Su 
sheets (3 miles x 2 miles) might provide suitable first- 
dense grid of points being taken over the selected sheets. 

If selection with equal probability of irregularly-shaped areas such as 
fields is required, these areas must each be defined by a Single point, such 


rvey quarter- 
stage units, a fairly 
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as the most northerly point of the area. The map is then divided into sampling 
units consisting of rectangular areas, a number of which are selected with 
equal probability, the fields with defining points in the selected areas being 
included in the sample. Only the selected rectangular areas need be 
demarcated. If more convenient, circular areas whose centres are located 
at random (or systematically) may be used in preference to rectangular 
areas. Rectangular areas have the formal advantage that the whole of the area 
is included once and once only in the aggregate of sampling units, but this is 
not of great practical importance. 

It should also be noted that with this method of selection the sampling 
units consist of groups of fields whose defining points are included in a single 
rectangular area, and not the fields themselves. Rectangular areas which 
contain no defining points must be counted as units of zero area. It can be 
shown that when the rectangular areas are small and mostly contain one or 
no defining point, the sampling errors of estimates of crop acreages are greater 
with this method of sampling than with the point method. The point method 
is therefore preferable under these circumstances. 

On the other hand, if the rectangular areas are large relative to the sampled 
fields—as will be the case, for example, if whole sheets of a map are surveyed— 
the use of defining points in this manner saves splitting fields which are cut 
by the map boundaries. Some slight additional variance will be introduced 
unless the total area of all fields is determined and used as supplementary 
information. Sisco 

Maps have been extensively used as frames for the estimation of the 
acreages of crops in surveys conducted by the Calcutta Institute of Statistics 
(Mahalanobis, 1944, A; 1946, A; 1940, H; 1945, H; 1946, H). The 
method followed is to demarcate square areas located at random on the maps, 
covered in whole or in part by these areas. The areas 
elds are determined from the maps by measurement. 
This measurement of areas and their subsequent summation might be avoided 
by the use of point sampling : if each square area were replaced by a square 
pattern of 9 or 16 points, for example, it would appear that the loss of accuracy 


would be small (see Example §.11.c), 


and to survey all fields cı 
of the whole and. part fi 


4.25 The 1942 Census of Woodlands 

The 1942 Census of Woodlands covering England and Wales provides an 
example of the use of maps as a frame. The object of the survey was primarily 
to determine the volumes of standing timber of various types in the country, 
and their broad regional location, in order to estimate the amount of available 
home-grown timber and to plan its utilization. — a 

The melenas initially planned to be taken in two parts, each consisting 
of 5 per cent. of the total land area. The sampling units were 6-inch Ordnance 
Survey quarter-sheets (3 X 2 miles), systematically located on two inter- 
penetrating 12 x 10-mile rectangular grids, one for each part. All areas of 
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woodland of over 5 acres on the selected quarter-sheets, and 1 in 5 of the areas 
under 5 acres, were surveyed. Areas of woodland cut by the boundary of the 
map were surveyed if their southernmost point was included in the selected 
map, areas subdivided by rides marked on the map being treated as separate 
areas for this purpose. The land areas covered by the selected maps were 
also inspected in the course of the survey to determine any new plantings since 
the map was last revised, thus correcting for any incompleteness in the 
frame. 

The woodland areas were divided by inspection on the ground into “ stands,” 
each of which represented a homogeneous area of woodland. The boundaries 
of these stands were demarcated on the maps so that their areas could be 
determined, and a representative plot was chosen from each stand on which 
all or a sample of the trees were measured. Representative plots were used 
instead of random plots because reasonably accurate volume figures for the 
individual stands were required. Control of the bias introduced by the use 
of representative plots was effected by determining the quantities of converted 
timber actually obtained from surveyed stands felled in the course of ordinary 
forestry operations. ‘This procedure served also as a check against any errors 
in the assumed wastages on conversion. A further control by the measurement 
of randomly selected plots on a sub-sample of stands was also planned, but 
was not in fact carried out. 

The total area of woodland was determined independently from the areas 
coloured green on the l-inch Ordnance Survey sheets (see Example 7.18). 
Errors in the 1-inch sheets were allowed for by comparing the selected 6-inch 
sheets with the l-inch sheets after survey. The final estimates of volume 
were calculated from the volumes per acre determined from the surveyed 
areas and the total areas determined as above. 

A first estimate of volumes was required within six months of the decision 
to undertake the survey, and it was thought that with the teams available the 
first part of the survey could be completed and the estimates prepared within 
this time. Before field work commenced, however, it became apparent that 
the original programme could not be adhered to. Each selected quarter-sheet 
was therefore roughly divided into two halves as similar as possible, and one 
half of each sheet was selected at random for the first part of the survey, giving 
a 24 per cent. sample of all woodlands in the country. Subsequent calculations 
of the sampling errors showed that this 24 per cent. sample was quite adequate 
for the determination of general policy, which was the first objective of the 
survey. 

The survey was then completed in two further parts, first the remaining 
halves of the first set of quarter-sheets, and secondly the other set of quarter- 
sheets. The three parts therefore consisted of 2} per cent., 24 per cent, and 
5 per cent. respectiyely of all the woodlands of the country. In addition certain 
heavily wooded areas were completely surveyed. 

The history of this survey demonstrates the extreme flexibility of sampling 
surveys, and the way in which they can be made to yield preliminary results 
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of ascertainable reliability. By the procedure of first surveying a properly 
selected quarter of the whole sample, it was possible to obtain the preliminary 
estimates in the required time in spite of unexpected delays in the commence- 


ment of the survey. 


4.26 Frames for undeveloped areas 


If no accurate maps are available, exact location of previously demarcated 
small sample areas on the ground will be impossible. Alternative methods 
must therefore be employed. 

For completely undeveloped areas such as natural forests the line method 
of sampling is very suitable, provided the terrain and vegetation is such that 
the lines can be followed on given compass bearings without an undue amount 
of deviation. Distances along the lines can be determined by some simple 

‘ope, or even by pacing. Where volume measure- 


measuring device such as a r 
ments are required small areas can be demarcated at given distances along 


each line. : 3 ' 
Some frame for the location of the lines is necessary. This can often be 


provided by existing mapped roads or other tracks, but it is by no means im- 
possible to construct a secondary frame as the survey proceeds by the use 
of cross traverses, using any available tie-in points. Except where maps are 
to be constructed, no great accuracy in the location of the lines is required, 
since it is only necessary that they be located in an unbiased manner with a 
density which is the same for the different parts of the area, or, if not the same, 
is determinable. 

In areas in which a line on a fixed bearing cannot be followed, any attempt 
at complete and unbiased coverage must necessarily be very expensive. 
Often, however, a sufficiently unbiased sample of natural vegetation will be 
obtained by traversing existing tracks and taking sample areas at suitable 
intervals by offsets at right angles to the tracks. i If a map of these tracks is not 
previously available it may be worth constructing one by rope and sound or 
similar rough surveying technique. 3 

Crop surveys in partially developed areas without adequate maps present 
somewhat different problems. If the cultivated areas are located in the neigh- 
bourhood of villages, a two-stage sampling process will probably be required, 
a sample of villages being taken at the first stage. Since the total area of 
cultivated land associated with a village 1s likely to be closely correlated with 
the population figures, these (if known) should be treated as supplementary 
information. If not known the feasibility of making a simultaneous population 
census should be considered, since information on cultivated areas will be 
of more value if it can be related to population figures. In this case the sampling 
may well be two-phase, a larger sample being taken for the determination 


of population. 3 A : 
The survey of the cultivated areas associated with the selected villages 
will require the construction of second-stage frames. If the line or point 
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method of sampling is practicable this is likely to be the simplest method of 
dealing with compact areas of cultivation. Outlying fields will in this case 
have to be enumerated and sampled separately. 


In many cases enumeration of all fields will be the only practicable method. 
The preparation of a sketch map will then be advisable. A certain percentage 
of the enumerated fields can be measured for area, and the crops determined 
if this has not been possible at the mapping stage. If the cropping is known, 
stratification by crop should be made before the selection of the sample for 
area measurements, A frame of this kind may remain serviceable, with some 


revision, over a number of years. It will also serve to locate the samples 
required in a crop estimation scheme. 


4.27 Use of aerial survey photographs 


When no maps are available the possibility of using aerial survey photo- 
graphs as a frame for agricultural and land utilization surveys should be borne 
in mind. Although it is unlikely to be practicable to make an aerial survey 
simply for the purpose of providing a frame for sample surveys, it is often 


possible to utilize a survey that has been undertaken or is contemplated for 
other purposes. 


Any aerial photographs covering the area are likely to provide an adequate 
frame, though the use of aerial survey photographs even for a frame is not 
as simple as it appears at first sight. The mere handling of the photographs 
covering any large area is a somewhat difficult task which demands an adequate 
and properly trained office staff. Moreover, aerial photographs are subject 
to variations of scale (and also distortion) due to tilt and changes of altitude 
of the aircraft. The stated scale is therefore not always correct, and the scale 
sometimes exhibits disconcerting variations even over different parts of the 
same mosaic. The precaution should therefore be taken of checking the scale 
by means of measurements on the ground in a sufficient number of instances 
to make certain that no important source of error is introduced. 


Various methods of sampling can be used in conjunction with aerial 
photographs. If crop acreages have to be determined, point sampling is 
suitable. After the points have been marked on the photographs the fields 
in which these points fall must be identified on the ground and the crops 
growing on them recorded. In order to avoid excessive travelling it will almost 
certainly be worth using a two-stage process, the units at the first stage being 
rectangular areas which can be demarcated on the photographs, with a number 
of points taken within each of the selected areas, 

If line sampling is required, the lines can first be demarcated on the 
photographs, and subsequently surveyed on the ground. In certain 
circumstances it may be possible to make the intercept measurements on the 


photographs, using the ground survey merely to determine the characteristics 
of the various intercepts. 
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If areas such as fields, the boundaries of which are recognizable on the 
photographs, are to constitute the sampling units, they may be selected with 
probabilities proportional to their sizes by point sampling. If there are likely 
to be ambiguities in the definition of the boundaries the units should be 
demarcated before selection. 

If natural units such as farmhouses which depend on point locations are 
to be selected, small rectangular or circular areas may be used as sampling 
units in the same manner as in selection from a map. 

In certain cases aerial photographs may provide the necessary information 
without any ground survey work. It is usually possible, for example, to 
recognize cultivated areas on the photographs, and the total cultivated area 
may consequently be determined directly from the photographs. In certain 
cases it may even be possible to differentiate between the different crops. In 
these cases the total cultivated area and the proportions of the area under the 
different crops can be determined by sampling of the photographs, point or 
s convenient. If desired, adjustments for variations 
in scale can be made by varying the spacing of the points or lines. 

In some cases the differentiation between the different crops on the 
photographs may be only partial, or subject to error. In such cases a sub-sample 
of the points classified on the photographs can be re-classified by ground 
survey. The information provided by the photographic classification will then 
serve as supplementary information. By this procedure the amount of ground 
survey necessary may be very considerably reduced, — The examination of 
stereo-pairs may be a considerable aid to the classification of certain types of 
area. icula est areas. 

| a O is specially undertaken for the purpose of a sample 
census or survey, it is possible to reduce the amount of photography by taking 
parallel strips of photographs separated by unphotographed areas, but aerial 
photographs taken in this manner will not be of much use for mapping purposes. 


Tf no map frame is available, @ few cross-strips will have to be taken to provide 
Too much reliance must not be placed on 


links between the separate strips. I d NO | K 
the accuracy of the location of the strips unless special navigational aids are 


installed. 


line sampling being used a 


4.28 Crop estimation 

The total yield of a crop can be regarded a5 the;product:otite acreage sao 
the mean yield per acre. ‘These two quantities may therefore ‘be. estimated 
separately. Estimates and forecasts of the mean yield pence of Reise 
be related to the conventions adopted in the estimation of acreage, paral 
with regard to areas on which the crop has failed or poe pe di 

The estimation of acreage has already been saci Dn pee ie 
sections, and in this section we shall ieir mosey bo concerned with the 

A 5: yield per cre. 

problem of the estimation of the more i 

There are a number of ways in which estimates of the mean yield per acre 
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of a crop, or the total yield, may be obtained. These may be broadly classified 
as follows :— 


(1) Reports from crop-reporters, who, at or subsequent to harvest, make 
returns to a central authority of their estimates of the average yields 
of the crop in their own districts, these estimates being based in the 
main on general impressions, discussions with farmers, etc. 


(2) The harvesting of small sample areas of the crop immediately prior 
to the main harvest. 


(3) Eye estimates of the yields of a sample of fields, with subsequent 
calibration of these eye estimates by comparison with the actual yields 
of some at least of the sample fields. 


(4) Co-operation with the farmers at harvest time so that accurate yield 
figures may be obtained from a sample of fields as they are harvested. 

(5) Returns by farmers of the yields of their crops. 

(6) Market returns, export statistics, etc. 


If necessary, Methods (2) and (3) may be combined in a two-phase sampling 
scheme, eye estimates being taken from a comparatively large sample of fields, 
with crop-cutting samples from a smaller sub-sample of these fields. 

These various methods all have their advantages and disadvantages. 
Method (1), that of crop-reporters, is the one commonly adopted by countries 
with long-established and stable systems of agriculture. Its success depends 
on the ability of the individual crop-reporters to make accurate and unbiased 
estimates of the average yields of their districts. The method is not objective, 
and no assessment of its accuracy can be made unless in 
provided by some other method of known or ascertainable accuracy, are 
available for comparison. Doubt is often cast on estimates provided by the 
method because of disagreement with market returns, etc., and their lack of 
objectivity makes it impossible to say which set of estimates is at fault. 

Even if crop-reporters are reasonably accurate on the average over a run 

of years, estimates for particular years or particular districts may be subject 
to considerable errors. There seems to be a general tendency, for instance, 
to underestimate yields in good years and overestimate them in bad years. 
The accuracy attained may also be very different for the different crops. 
Moreover, spurious long-term trends may be introduced through gradual 
changes in the standards of the reporters, and this considerably reduces the 
value of the estimates as a measure of the improvement or deterioration of the 
agriculture of a country. Any sudden change in an agricultural system, such 
as the introduction of new varieties, or the bringing into cultivation of new 
land, may introduce disturbances into previously satisfactory estimates, 

Method (2), the harvesting of small sample areas, is theoretically capable of 
providing a completely objective estimate of the mean yield per acre of the 
standing crop at harvest time ; it will not, of itself, provide any estimate of 
the losses at or subsequent to harvest, In practice, however, serious bias 
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may arise in a number of ways if proper precautions are not taken. These 
sources of bias, and the practical details of the method, are discussed further 
in the next section. 

Method (3), that of eye estimates, has the advantage that on certain types 
of crop such estimates can be relatively rapidly made, and consequently a larger 
sample of fields can be visited in a given time. The difficulty of having to 
transport and thresh a large number of samples, which often arises with 
Method (2) is also avoided. Some results of a trial of this method on wheat 
are given in Example 6.15. The method is not suitable for root crops such 
as sugar beet and potatoes, since it is difficult to judge the yields from inspection 
of the tops. In such crops, however, there are no transport and threshing 
problems, since the samples can be weighed in the field. 

If calibration of the eye estimates from the farmers’ yields is not practicable, 
or if the calibration is found to vary substantially from year to year, a few 
specially-trained field workers can be used to take crop-cutting samples in 
order to calibrate the eye estimates of each investigator at the time of harvest. 

Methods (4) and (5) require the co-operation of the farmer. Method (4) 
differs from Method (5) in that in Method (4) the harvesting is done in the 
presence of an investigator, and if necessary with assistance, such as the 

ine, whereas in Method (5) reliance is placed 


provision of a threshing machi i ! 
entirely on the farmer to provide accurate yield figures. Owing to delays of 
threshing, etc., Method (5) is not likely to provide estimates till some time 


after harvest. noe 

Estimates from market returns, export statistics, etc. (Method 6) provide 
a useful basis for comparison with estimates by other methods, but such returns 
will only exceptionally give an accurate estimate of the actual yields, since 
the amount of{the crop passing through the market is likely to vary very 
considerably in different circumstances. 


In Methods (2) and (3), which require field investigators, the question must 


be considered whether the survey should cover the whole of the country or 
whether it should be confined to certain districts only, using a two-stage 


sampli ly an estimate of the yield of the whole country or 
pling process. If ony a atively few fields will need to be sampled, 


of istricts i uired, compat: ` A 
ie : Rae eee 5 for the selection of fields will result, Dealer 
a single-stage P s will avoid this difficulty at the cost of 


; two-stage proces Sate è 
dispersed sample. A 2 t of variation into the sampling error. 


: i veen-districts componen! ation 
E N emphasized that crop estimation, though theoretically 
, 


n ‘cal difficulties. The introduction of a satisfactory 
ae Pa a Paa the provision of objective estimates to check 

noes x A mates, Te uires Conti bay E e hs 
E a neta of workers. oe fonge eon ae 
crop-estimation projects should therefore not x oa 5 Seen i cena 
can be maintained. Nor should an existing metho z e a n a aie ace 
or disturbed until a better alternative has been evo a ae ae Rae a 
for come ume Imtheold and new methods are run in pi mi 


nuous work over a number of years 
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of years it will be possible to assess the reliability of the old method and its 
degree of bias, a task which will be quite impossible if it is discontinued before 
adequate comparative data have been obtained. 


4.29 Estimation of yield by the harvesting of sample areas 


As mentioned in the previous section, the estimation of the mean yield 
per acre of an agricultural crop by the harvesting of small sample areas presents 
many practical difficulties, and the results may be biased in a number of ways 
if the proper precautions are not taken. 

Errors can occur through faulty selection of the fields, through faulty 
sampling of the selected fields, through failure to take samples from the fields 
at dates sufficiently near harvest, or through failure to sample some of the 
selected fields owing to their having been harvested before they were 
visited. 

If rigorous means of selection are employed there is no reason why the 
selection of the fields should be faulty. If, however, the cruise method is used, 


in Section 3.15, the estimate will almost certainly be appreciably biased, 
though this bias may be reasonably constant from year to year if the same 
route is followed each year. On the other hand, the use of the cruise method 


An alternative procedure which is sometimes used is to follow the prescribed 
route and take a sample from all or a given fraction of the fields that are actually 
being harvested. This, however, may introduce an additional component of 
bias, since, unless special precautions are taken, limitations of time will result 
in the inclusion of a greater proportion of the fields which are harvested very 


such crops, however, it is usually 


between the time of taking the sample and date of harvest ; this latter date 


en b In the potato crop, for example, 
investigation has shown that the weight of tops provides a fair indication of 


the amount of further growth that may be expected. 
The cruise method of sampling, the 
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economical in travel. Which method is most suitable will depend largely on 
local conditions, and must be the subject of local investigation. 

Bias in the estimation of the yields of the actual fields can arise from 
improper location of the samples and from cutting a larger area of the crop 
than the true unit area. An example of such bias has already been given in 
Section 2.5. 

Edge effects are also liable to give rise to bias, since in an irregularly shaped 
field it is impossible without a great deal of labour to locate samples properly 
at random over the whole of the area. The method described in Example 3.2.b 
is clearly impracticable, and no simple method of traversing the field has been 
devised which will give equal probability of selection over the whole field. 
In practice, however, a systematic method of selecting the sample is quite 


adequate. The important thing is to see that the location of the sample units 


is as objective as possible. 

The determination of the bias arising from headlands, lower yields at the 
edges of the field, and errors in estimation of the area of the field—the U.K. 
Ordnance Survey areas, for example, include farm roads, hedges and ditches— 
can be made if required by more rigorous supplementary observations on a 
e of the fields. Often, however, the separate determination of 
f bias is of no great practical interest, since the losses at 
Il also affect the total amount of the crop that is finally 
and the total bias is best determined by comparison 
Ids on a sub-sample of the fields, or by determining 


small percentag 
these components Ot 
and after harvest wil} : 
available for cone eee a 
with the farmers’ reported yle 

P with the farmer. 


these yields in co-operation ; ` 
Two methods of locating the sample units have been found convenient 

in practice in this country- The first is to traverse the field diagonally from 
corner to corner, using one oF both diagonals, and locating samples at equal 
intervals along these diagonal lines. The interval required can be calculated 
g making an eye estimate of the number of paces. 


A ie al or i 
by pacing the diagon e consequence, since the exact number 


7 r imates are of littl 3 E 
pai in the ge pg mee al, Alternatively, if the crop is in rows the field 
ampling uni 


h of one end is paced from corner 4 

e rows. The lengt z 

can be traversed ann w entered at a distance of one-quarter of this length 
© corner B, ea T is then traversed along this row to the other end of the 
rom f i 

corner wy is selected by the same procedure. A suitable number 
eld, and a return roy in the same manner as in the case 


S LE each traverse 1; . ores F 
a pines Te ET method of sampling it is advisable to step 
oi a diagona traverse. 
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requisite number of paces have been taken, but the field workers must be 
thoroughly trained in this procedure. 

A good deal of work has been done on the most suitable size and shape of 
the sampling units. In this country experimental tests have shown that 4-6 
contiguous quarter-metre row lengths are suitable for cereal crops, with 6-10 
units per field ; for potatoes 4 units each of 6 ft. of row are now being tested 
on a large scale. Mahalanobis (1946, A), working in India, has used three or 
four concentric circles of 2-8 ft. radius, each annular ring being harvested 
separately so as to provide a control of field workers—with ordinary workers 
a bias is regularly found with the smallest circle—and this gives a check that 
the samples have really been taken in the prescribed manner. 

The best size and shape of the sampling units depends very much on the 
nature of the crop and local conditions, such as type of field worker, whether 
the crop is sown or planted in rows or broadcast, variability within fields, 
available equipment for threshing and transport, etc. Local investigation should 
therefore always be undertaken if any extensive work is contemplated. On 
the other hand, it should be recognized that the sampling error of individual 
fields is usually small relative to the variation from field to field, and consequently 
the introduction of a crop-estimation scheme need not await the results of such 
investigation ; any reasonably efficient method will give satisfactory results 
provided bias is avoided. (See Section 8.12.) 


4.30 Crop forecasting 


The term forecast is here used to denote an estimate of the yield of a crop 
furnished at some date well before harvest. The term is sometimes used to 
indicate estimates made by crop reporters at or even shortly after harvest, 
since such estimates are usually subject to later revision in the light of information 
received from farmers. Such estimates, however, are better termed preliminary 
estimates, in contrast to the revised or final estimates. 

There is some confusion also between forecasts and estimates of acreages, 
forecasts of mean yields per acre, and forecasts of total yields. Once the crop 
is sown the determination of the acreage, apart from crop failures, is a matter 
of estimation and not of forecasting, and any forecast of the total yield is usually 
best presented in the form of an estimate of the total acreage and a forecast 
of the mean yield per acre. 

There are three main methods of crop forecasting. Forecasts can be 
provided by crop reporters, they can be based on meteorological data such as 
rainfall obtained prior to the date of the forecast, or they can be based on 
observations and physical measurements of the growing crop, alone or in 
conjunction with meteorological data. 

Meteorological data do not directly provide forecasts of the yields. If they 
are to be used as a basis of such forecasts, reliable data on both the yields and 
the meteorological events must be collected over a number of years, and the 
crop-weather relations evaluated. The same is true if observations and 
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measurements on the growing crop are to be used. The evaluation of these 
relations requires the application of the method of statistical analysis known 
as “ regression analysis.” We shall not describe this method here, but it may 
be well to emphasize that its application is not entirely simple, and the advice 
of a mathematical statistician experienced in this type of work should therefore 
be sought. 

It must not be assumed that it will be possible to evolve a prediction formula 
which will give satisfactory results, even if accurate and extensive data 
are available. In the first place, the yield of a given crop is influenced by 
meteorological and other events up to and sometimes after harvest, and this 
may introduce too great a degree of uncertainty into yields predicted some 
months before harvest to make the prediction of any value. In the second place, 
although meteorological factors undoubtedly account for a good deal of the 
variation in crop yields, they are not by any means the only factors. Changes 
in variety, insect pests, plant diseases, exhaustion of the fertility of the soil, 
he type of land under crop, changes in the amount of fertilizers, 
and many other factors may also exert a major influence. Thirdly, meteorological 
effects are often somewhat complex, and it may therefore be impossible to 
determine them from a set of data extending over a limited number of years ; 
owing to the similarity of weather conditions over large areas, data from a 
number of districts in any one year are only a partial substitute for data extending 
over a number of years. ' i 

One advantage of using measurements of crop growth instead of relying 
wholly on meteorological observations is that the crop is thereby used as its 
own integrator of meteorological and other effects up to the time of the 
measurements. Frost and flood damage, for instance, are clearly better assessed, 
once they have occurred, by survey of the crop than by examination of 
meteorological records. The selection of the particular types of observations 
and measurements which are likely to give an adequate basis for forecasting 
is a problem ón which further scientific research is required, particularly in 
the case of grain and other seed crops. In the case of root crops investigation 
has already shown that the amount of growth made by the tubers or roots, 
coupled with some measure of the extent to which the plant is still growing, 
e.g. weight of tops, are likely to give satisfactory results. ; 

Since the evolution of a satisfactory method of crop forecasting demands 
a knowledge of the actual yields over a period of years, an investigation of 
suitable methods can be combined with an objective crop-estimation scheme. 
Once the observations and physical measurements which are likely to give 
useful information have been decided, all that is necessary is to take these 
measurements on a sub-sample of the fields which will subsequently be selected 
for sampling at harvest. In the initial stages it may be better to carry out the 
observations on a special sample of fields, rather than on the more scattered 
sample which will be suitable for crop estimation proper. More intensive 
investigations can also be carried out on experimental plots on which different 
yarieties are sown, and which are subject to different cultural treatments and 


changes in t 
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sowing dates. Experimental plots by themselves, however, are not likely to 
provide all the information required for the evolution of a suitable forecasting 
scheme, since the variation from field to field in a given district is often quite 
large, and the inclusion of a number of fields in the district usually gives a 
much more adequate representation of the average meteorological effects in 
that district than will a single field. 

If observations and measurements are to be made on the growing crop, 
a sampling scheme will have to be devised in order that single plants or small 
areas may be selected for measurement. A method of selecting wheat shoots 
for height measurements, for example, is described in Section 2.4. The 
principles to be followed in the location of the sampling units in the fields 


or experimental plots are similar to those which operate in the selection of 
samples for yield estimates. 


4.31 Determination of the size of sample when the sample is fully 
random 


As has been indicated in Chapter 3, the size of sample required to achieve 
a given accuracy depends on the variability of the material and the extent 
to which it is possible to eliminate the different components of this variability 
from the sampling error. In this and the following section we will describe 
the procedure which is appropriate for determining the size of a random 
sample, and indicate the general relationship between the errors of a random 
sample and other types of sample. Detailed consideration of the more involved 
types of sampling must be deferred till Chapter 8, where the comparative 
accuracy of the various types of sampling is discussed. 

In the discussion of sample size we shall re 
error. As already explained in Section 3.7, t 
an estimate is a measure of the average magn: 
error to be expected in that estimate. It also 
frequency with which errors of various magnitu 
(Section 7.3). In rough general terms, one- 
will be greater than the standard error, and 
twice the standard error. 


In the case of a fully random sample from a large population the formula 
for the standard error of the estimate of the proportion of units of a given 
type, ze. having a given attribute, is very simple. * If p is the Proportion of 
units of the given type in the whole population, and q = 1 — pis the Proportion 
not of the given type, the standard error of the proportion of units of the 


given type in a random sample of n units (which provides an estimate P of p) is 
given by 


quire the concept of standard 
he sampling standard error of 
itude of the random sampling 
provides an indication of the 
des may be expected to occur 
third of the actual sampling errors 
one-twentieth will be greater than 


i 
standard error of p = q 


* The adjustment for finite sampling, req 


uired when the Population is 
et i A no 
relative to the sample, is given in Section 8.1. t large 
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The formula holds unchanged if the proportions are replaced by percentages. 
Thus 
o/ o/ 
standard error of (p%) = af Tanu 
The full line of Fig. 7.4 provides a graphical representation of the standard 


errors given by this formula. If 20 per cent. of the units of the population 
are of the given type, for example, the standard error of the percentage of 


units in a sample of 100 is 
20 x 80 
Ya si 


which is the value given by the full line. Thus estimates in the range 
20 + 4 per cent., Ze. between 16 and 24 per cent., will be obtained in two-thirds 
of all samples of 100 units, and estimates in the range 20 + (2 X 4) per cent., 
i.e. between 12 and 28 per cent., will be obtained in nineteen-twentieths of all 
samples. If a sample of 1000 is taken, the standard error will be 1-26, and 
estimates between 18-7 and 21-3 per cent. will be obtained in two-thirds of 
the samples (Section 7.4). 

When dealing with estimates of quantities such as means and totals it is 
best for our present purpose to work in terms of the percentage standard 
errors. The percentage standard error of the estimate of a quantity is the 
standard error of the estimate expressed as a percentage of the true value of 


the quantity. 
error of the estimate of the total number of units 


The percentage standard timat 
having the given attribute in the population is the same as the percentage 


standard error of p. From the above formula we see that this percentage 


standard error is given by 


gi (4%) 

percentage standard error = 1004 ap a 1004) 7 (p%) 

The percentage standard errors given by this formula are shown by the dotted 
line in Fig. 7.4. : 

In a population in whic! 

for instance, the percentage S 


80 
= 20 per cent. 
100 af aac 


If there are 10,000 units in the whole population, 2000 will possess the given 
attribute. The standard error of an estimate of this number from a sample 
of 100 will therefore be 2000 X 20/100 = 400, i.e. ın two-thirds of the samples 
estimates between 1600 and 2400 will be obtained. 


* This result can also be obtained directly from the actual standard error of the 


percentage, which is 100 x 4/20 tê: 20 per cent. 


h 20 per cent. of the units possess the given attribute, 
tandard error with a sample of 100 is* 
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The formula may be re-written so as to give the number 7 required for 
the sample when the required standard error of P or the required percentage 
standard error is known. We have 


pa 
Gi (required standard error of p)? 


10,000 q 
p (required percentage standard error)? 


The proportions p, q and p may be replaced by the corresponding percentages 
without change. 

If, for example, we are sampling a population in which it is believed that 
about 20 per cent. of the units are of a given type, and it is required to determine 
this percentage with a standard error of 1 per cent. (i.e. a percentage standard 
error of 5 per cent.), we shall require to take a sample with a number of units 
given by 

20 x 80 10,000 x 80 
12 20 x 52 


= 1600 


These formulæ hold only for a random sample in which the sampling units 
are the units for which the proportion having a given attribute requires to be 
estimated. In sampling a human population, for example, the sampling unit 
may be the household and not the individual. In this case the standard errors 


percentage standard error — Percentage standard deviation of a unit 
Vn 
ME (percentage standard deviation of a unit)? 
(required percentage standard error)? 
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the sample to give a standard error of 5 per cent. in the estimate of the mean 
income per family is therefore 


For a standard error of 1 per cent. the number required would be 1050. 

A similar calculation for the wheat acreages of the random sample of 
Hertfordshire farms (Table 7.2), already described in Section 3.7, is given in 
Example 7.2.b. In this case s = 1/1351-4 = 36-8. The mean wheat acreage 
per farm is 18-6, and the percentage standard deviation is therefore 
100 x 36-8/18-6 = 198. The percentage standard error is here very large 
because there are a large number of farms growing little or no wheat. In order 
to determine the total wheat acreage of an area with a 5 per cent. standard 
error from a random sample of farms, therefore, we shall require 
1982/5? = 1570 farms. 

From the above formule we see that the standard errors of estimates 
derived from random samples of different sizes taken from the same population 
are inversely proportional to the square roots of the numbers in the samples. 
Conversely, to reduce the standard errors of the results in a given ratio we 
require to increase the size of the sample by the square of the ratio. Thus, 
in order to halve the standard errors of the results we must multiply the size 


of the sample by 4. 


4.32 Some general rules on size of sample 

From the above discussion it will be seen that the calculation of the size 
of sample required to attain a given accuracy is a relatively simple matter 
when a random sample is taken. With the more involved types of sampling 
the calculations are more complicated, and more must be known of the material 


that is being sampled. 
_ Calculation of the accuracy w 
is, however, often a useful preliminary 
Tequired in the more involved types of 
unit is under consideration, the reduction in number 
ma nor complicated types of sampling is ioe ie 
al variabili ich i by the imposition 
stratification 2 a ree supplementary information. It is frequently 
Possible to form a rough idea of the likely reduction from a general knowledge 
of the characteristics ee the material. Thus in a survey designed to determine 
Crop acreages, using farms as sampling units, itis to be expected that stratification 
by size of farm and the use of a variable sampling fraction in conjunction with 
such stratification will each give considerable increase in accuracy over a random 
sample. This is confirmed by the results already given in Section 3.7. 
When sampling units of different types or of alternative sizes are under 
consideration, the situation is more complicated, as 1s shown by the results 


already presented in Section 3.11. This is true also of multi-stage sonus 
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The following general rules may be of value. Rules 1 to 5 are applicable 
to the case in which only one type of sampling unit is under consideration, 
rules 6 and 7 to the case in which more than one type of sampling unit is being 
considered. 


(1) The use of stratification, a variable sampling fraction or supplementary 
information may in general be expected to increase the accuracy. 
Consequently the calculation of the number of units required in the 
case of a random sample gives an upper limit to the number of units 
required in any reasonable form of sampling using the same sampling 
units. 


(2 


> 


Stratification will only increase the accuracy substantially if there are 
marked differences between the different strata. The increases are 
usually larger for quantitative characters than for qualitative characters, 
i.e. attributes (Table 3.7.b). 


(3) A variable sampling fraction can greatly increase the accuracy when the 
units vary greatly in size, or more generally in variability from stratum 
to stratum. Fractions which increase the accuracy for quantitative 
characters may reduce it for qualitative characters (Table 3.7.b). 


(4) The use of supplementary information can greatly increase the accuracy 
in appropriate cases, and often serves as an alternative to stratification 
(Table 3.7.b). 

(5) Since there must be at least one unit per stratum, more detailed 
stratification is possible with larger samples. In such circumstances 
the increase in accuracy with increasing size ‘of sample will be more 
rapid than is indicated by the square-root law. Conversely, for samples 
of a given accuracy the advantage of stratification may be reduced by the 
fact that reduction in the size of the sample necessitates an increase 
in the size of the strata (Section 8.15). 

(6) If sampling units of type A consist of aggregates of sampling units of 
type B (e.g. households and individuals), the use of sampling units of 
type A in place of units of type B will usually result in lower accuracy 
for a given amount of material in the sample (Table 3.11.b and Ex 
7.8.b). 

(7) If multi-stage sampling is used, more final-stage units will be required 


than will be the case with single-stage sampling of the final-stage units 
(Tables 3.7.b and 3.11.b). 


ample 


All the above rules are indicative only. The quantitative gains in accuracy 
or reduction in number of units required in any particular case must be evaluated 
by the methods described in Chapter 8. The final decision as to the type of 
sampling to be adopted necessarily depends on the relative accuracy of the 
various methods and their relative costs. 

It is advisable at the planning stage to consider as far as 


] : possible the form 
in which the results require to be presented. 


In more complicated surveys, 
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particularly of the exploratory and research type, the results themselves will 
in part suggest the form in which they require to be presented, but in simple 
types of survey the form of presentation can often be laid down in considerable 
detail. This is a help in verifying that the sample is sufficiently large to cover 
the required domains of study adequately. 


4.33 Pilot and exploratory surveys 


From what has been said in the previous sections of this chapter it will 
be apparent that there are many points on which decisions can only properly 
be reached after preliminary investigations in the form of a pilot survey have 
been carried out. On material of which nothing is initially known, e.g. in 
surveys of undeveloped territory, a preliminary exploratory survey may be 
required before any proper pilot survey can be undertaken. In addition to 


providing general information, such an exploratory survey may be used to 


construct a first-stage frame. 
A pilot survey has two main objects: firstly, the provision of information 


on the various components of variability to which the material is subject, and 
secondly the development of field procedure, the testing of questionnaires and 
the training of investigators. A pilot survey may also provide data for the 
estimation of the various components of cost of the different operations involved 
in the survey, e.g. interview time, time of travel, etc. Knowledge of such 
costs is required not only as a basis for general estimates of cost, but also in 
order to determine what type and intensity of sampling will be most efficient. 

A further function of pilot surveys is to determine the most effective type 
and size of sampling unit. In a crop-cutting survey involving the harvesting 
of small areas, for instance, we may require to determine the best size and 
shape for these areas. In order to investigate the variability of different types 
and sizes of unit it is necessary to be able to form aggregates which represent 
the largest units which are of interest. Thus, if areas ranging from, say, 
J ft. x 1 ft. to 3 ft. x 3 ft. are under consideration, it is necessary, or at least 


preferable, to harvest randomly distributed areas of 3 ft. X 3 ft. in sections 


of } ft. x 1 ft. (Section 8.14). i 7 
Pilot surveys will not normally be required for material on which there is 
rience, since every survey provides information 


considerable previous survey XP z ; 
on the variability of the material surveyed, and this can often be used in the 


planning of further surveys. Thus the 1942 Census of Woodlands (Section 4.25) 
was planned on the basis of experience gained in the 1938-9 Census of 
Woodlands. Even in such cases, however, new questionnaires and new methods 
of observation and measurement should be tested on a more or less random 
sample of the material before being put into operation. 

The testing of the field procedure by means of a pilot survey is discussed 
in Chapter 5, and its planning needs no special comment. The planning of a 
pilot survey to provide relevant information on the various components of 
variation is rather more difficult. The finer points will be fully apparent after 
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the estimation of efficiency has been discussed (Chapter 8), but a few fundamental 
points may be made here. 

At first sight it might be thought that a fully random sample would be 
a satisfactory form of sample for a pilot survey. This, however, is not necessarily 
the case. As a simple example we may consider the survey of material in which 
the use of a stratified sample is likely to be appropriate. In this case we shall 
require to determine the components of variation within strata. This can 
only be done from a fully random sample if the sample is sufficiently large 
relative to the number of strata for the majority of strata to contain at least 
two units. 

This difficulty can be overcome by adopting some form of multi-stage 
sampling, so that the whole of the pilot sample is concentrated in a few of 
the strata. The primary stage of this multi-stage process need not necessarily 
be very rigorous. Thus in a survey covering a human population concentrated 
in villages, it may be considerably more convenient to use certain villages for 
the pilot survey rather than others. Provided that sufficient is known of these 
villages to indicate that they are fairly typical there is no serious objection to 
their use. Similarly in area sampling it may be sufficient to confine the pilot 
survey to districts conveniently situated with regard to the main and regional 
headquarters, provided there is some assurance that the different types of 
district are properly represented. 

Within the towns or areas selected for the pilot survey 
survey can be made. If necessary a further stage may be ini 
sampling process. Thus the survey may be confined to a s 
in a city, instead of the whole city being sparsely covered. 
it is possible to obtain data which cover selected areas with 
is of the same order as that which will be adopted in the final 
is done the various possible types and sizes of strata can 
investigated. 

The concentration of the pilot sample into selected areas should not be 
pushed to extremes. It is better to have adequate cover of a representative 
sample of the data than highly detailed cover of small and possibly non- 
representative sections of it. Detailed cover will in any case not be required 


if the material is such that very small strata are known to be impracticable 
or of no value. 


a fairly intensive 
troduced into the 
election of blocks 
By this means, 
a density which 
Survey. If this 

be effectively 


When the possibilities of multi-stage sampling have to be investigated, 
the problem of designing a pilot survey is more difficult. The component 
of variability that governs the samp) 
variability of first-stage units (Section 7 
first-stage units are represented in the pi 
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be obtained. In surveys in which comparatively few first-stage units are used, 
such as localized surveys on human populations (Section 4.18), the prior 
determination of the expected accuracy from pilot survey data is likely to 
prove to be impracticable. Reliance will then have to be placed on previous 
experience or on estimates of error from previously available data. This, 
however, is not quite so serious as it appears at first sight, since multi-stage 
samples with relatively few first-stage units are in general used most frequently 
in surveys of the research and investigational type, or in surveys which are 
repeated at intervals. In such cases the first few surveys can be used to provide 
data on the various components of variation, and the design can be modified 
if necessary in the light of this information. 

Even if the first-stage sampling error cannot be determined by means of 
a pilot survey, such a survey can be made to furnish reliable information on 
the errors to be expected at the second and subsequent stages. This will enable 
the survey to be planned so that the necessary accuracy is obtained on 
comparisons between first-stage units, which frequently form important domains 
of study in surveys of this type. —— 

Elaborate pilot surveys are not likely to be worth while in small-scale 
surveys. It is usually better to proceed with the actual survey work, even if the 
design adopted is not as efficient as would be possible if a full-scale pilot survey 
were first undertaken. If a series of small-scale surveys of similar type have 
to be undertaken the earlier surveys will themselves act as pilot surveys for the 


later surveys, the design of which can be modified in the light of the experience 


gained. 
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CHAPTER 5 


PROBLEMS ARISING IN THE EXECUTION AND 
ANALYSIS OF A SURVEY 


5.1 Types of problem 


The problems arising in the execution of sample censuses and surveys are 
for the most part similar to those encountered in complete censuses. We 
shall therefore not discuss these problems in detail, but merely draw attention 
to some of the points which are of particular importance when sampling is 
used. 

The various phases of the work subsequent to the planning stage may be 
broadly classified as follows :— 

(1) Setting up of the general administrative organization. 

(2) Design of forms. É 3 

(3) Selection, training and supervision of the field investigators. 
(4) Control of the accuracy of the field work. 

(5) Arrangements for follow-up in the case of non-response. 
(6) Abstraction and coding of the information. 

(T) Statistical analysis. 

(8) Reporting. 

In certain types of survey no action may be required under some of these 
heads. In a survey conducted by postal questionnaire, for example, there will 


be no field investigators unless they are required to deal with cases of 
non-response. 


5.2 Administrative organization 


The administrative organization required will depend very much on the 
nature and scale of the census or survey, and on the area to be covered. 

The main field task for which an extensive administrative organization is 
required is the supervision of the investigators, or the carrying out of follow-up 
enquiries in cases in which there is no staff of investigators to undertake this 
work. The main administrative task at headquarters is the supervision of the 
computing and clerical staff engaged in the abstraction and analysis of the 
completed forms. 

Every opportunity should be taken to utilize existing administrative and 
office organizations. When the survey covers a large area, supervision from 
a central office is likely to be difficult and in such cases it is best to establish 
regional offices. Very frequently some existing organization can be used for 
this purpose. 

It is not necessary for the computing and clerical staff at headquarters to 
be administered by the same organization as the field staff. In many cases 
it is convenient to use some existing statistical organization to carry out the 
analysis, and to utilize some administrative organization with regional offices 
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to supervise the field work. Often an organization can be employed which 
already has contact with the respondents, or which has on its staff individuals 
who are suitably qualified to act as field investigators. 
5.3 Design of forms 

Most careful attention should be given to the detailed design of the various 
forms that will be used in the course of the census or survey, especially the 
forms on which the observations and answers to questions are recorded. This 
applies also to the instructions and explanatory notes which accompany the 
forms. 
The content of the forms for the recording of the information is determined 
by the information that is required, and has already been discussed. They 
may be forms designed for completion by the recipients with little or no 
assistance, questionnaires which form the basis of interviews, or forms on 
which observations and measurements taken in the field are recorded by the 
field investigators. i y i . 

Each type of form presents its own difficulties of design. The simplest 
is that on which observations and physical measurements are recorded by the 
field investigators themselves. In this case, the chief points to observe are 
that the form is convenient to use, and that the results are set out in such a 
manner that they are convenient to abstract. Figures which have to be 
summed by the field investigators, for example, should be arranged vertically 
and not horizontally, as the investigators will not be using calculating machines. 

In* surveys which involve observations and physical measurements it will 
almost always be necessary to supply field investigators with a separate set of 
instructions, Consequently there is no need for the form to carry its own full 
explanation, though it should of course be made as self-explanatory as possible. 
Experience has shown that instructions to field investigators should be very 
detailed, and should cover all possible points of uncertainty or ambiguity. 


Provision should also be made for revision and amendment as need arises, 
i set of instructions which are completely 


without assistance, very careful atte 


both of th tions and explanatory notes, I f 
mind of ie AIN to chat is required. Detailed and lengthy explanations 


should b i as possible. Such explanations as have to be given 
should toute ee in Conjunction with the question to which they refer. 
The common practice of giving detailed explanatory notes on the back of a 
form is not very satisfactory, since it frequently results in the respondent 
filling in the whole or portions of the form without consulting these notes. 
Forms of this type should, if possible, carry a brief explanation of the reasons 
We meea een ce ie has been given in the press and elsewhere it is 
unlikely tha ni -ill in fact have seen it. ? 

, In ne = eget type designed for completion by field 
investigators the investigators must be instructed whether the questions are 
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to be put in the exact form given, or whether they can be asked in a general 
form. As already stated, in most cases the general form is more suitable, but 
in questions on opinions, where different forms of wording may be expected 
to affect the answer, it may be necessary to adhere to an exact form. 

With the general form of question explanatory notes are often required 
in order to make clear to the investigators exactly what information is required. 
Such explanatory notes can either appear on the questionnaire itself or be given 
in a separate set of instructions. The latter course results in a much more 
compact form of questionnaire and is suitable when full-time investigators are 
used. The former course is more likely to ensure that all investigators are in 
fact aware of what is really required and is best when the investigators are 
carrying out the survey in the course of other duties. Ina lengthy questionnaire 
this will necessitate the questionnaire being in the form of a booklet. Such 
questionnaires are more bulky and costly, and frequently entail more work 
in the coding of the results, but are nevertheless frequently preferable in these 
circumstances. 7 

Forms may be either printed or duplicated. Printing is much to be preferred 
as it results in much neater, clearer, and more compact forms. The ordinary 
type of duplicating paper is also not very suitable for writing on, particularly 
in ink. 

Small forms may be printed on cards instead of paper. Cards are often 
more convenient for field use, and in small surveys of which the results are 
analysed by hand the use of cards may save transcription before analysis. 
Alternatively Cope-Chat cards may be used (Section 5.10). 

Forms printed on paper may be made up in the form of blocks with card- 
board backs. This facilitates writing in the field. Alternatively they can be 
clipped on to a wooden board. If duplicate copies of the completed forms 
arë required, provision should be made for carbon copies to be taken at the 
time the forms are filled in. 

Forms larger than foolscap should be avoided if possible. 
troublesome both to handle and to store. Forms of more than one sh 
also be avoided. It is usually better to use both sides of a sheet or 
to use two sheets, 

Forms should always be subjected to a preliminary trial in the field, Only 
in this way will minor faults be discovered. In the case of questionnaires this 
test is best arranged in two parts: 


They are 
eet should 
card than 


(a) a trial by investigators who are fully experienced in questionnaire 
work, and who are conversant with the problems under investigation ; 


(b) a trial by investigators of the type that are to be employed in the 
survey. 


The first trial will serve to determine whether the questionnaire is in the 
form most suitable for eliciting the required information from the r 
and the second trial will provide information on whether the qu 
associated instructions are understood by and within the capab 
jnvestigators, 


espondents, 
estions and 
ility of the 
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5.4 Special tests of questionnaires and investigators 


In certain cases it may be worth making rigorous tests of different forms 
of the same question to see whether there are any material differences in the 
answers received. Since the question cannot be put in both forms to the same 
respondent, this must be done by the use of interpenetrating samples, using 
the same investigator or investigators for both forms. In order to eliminate 
the effect of any progressive change in the investigators or the respondents the 
tests of the two forms should proceed simultaneously. In the same way the 
difference between two or more investigators using the same form of question 
can be tested. 

More elaborate and precise tests of differences resulting from different 
forms of the same question and different investigators can be carried out by 
using the methods developed in the design of experiments. Thus if two forms 
P and Q of a question and three investigators A, B and C require to be tested, 
groups or blocks of six respondents may be used. The blocks should be chosen 
in such a manner that the respondents within each block are as alike as possible, 
using any available prior information. The six question-investigator com- 
binations PA, QA, PB, QB, PC, OC are then assigned at random to the 
respondents of each block. This design 1s technically known as a 2 x 3 
factorial design in randomized blocks. By this device differences between forms 


of question and between investigators are simultaneously tested. Information 


is also obtained on what are known as the interactions between forms of question 


and investigators, 7.¢. On whether the differences between the forms of question 
are different for the different investigators, and vice versa. The grouping of 


respondents into blocks ensures that errors due to differences between 
respondents are eliminated as far as possible ; the randomization enables the 
standard errors of the comparisons to be calculated by the methods appropriate 
to the analysis of replicated experiments (see for example Fisher’s Design of 
Experiments or Snedecor’s Statistical Methods).* i 

Investigations of this kind can be carried out in the course of an actual 
survey, but they are normally better undertaken as, ae inves ee 
as part of the pilot survey, since information on different a o apaa 
will be required at the planning stage, and it is usually meni le to complicate 
the field procedure of a large survey. Routine tests of : ite pees 
different investigators may, however, be incorporate without ea ue 
complication in the actual survey by means of interpenetrating samples. 


and supervision of field investigators 


i i bers of 
ecially appointed, they may be mem 
ee work but over whom authority can be 


* An interesting investigation of the differences between three ea eevee 
(two belonging to P ofescional organisations, one of wen Ay oe Easo] of E 
ae out be ne! Department of Resear reed 1952, DA. on A factorial design was 

urbin and Stuart, 1951, D’, Booker an : Ia 
employed. 


5.5 Selection, training 


Field investigators may be 
existing staffs appointed for O 
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exercised, or they may be individuals asked to undertake the work on a voluntary 
basis or for a small honorarium. 

The problem of selection arises primarily in the case of investigators 
appointed specially for the work. In order to secure a suitable type of person 
preliminary tests should if possible be made of all applicants, and the early 
work of newly appointed investigators should be carefully watched and super- 
vised. In large-scale censuses and surveys, proper training courses should 
be arranged. If a pilot survey is undertaken this provides a valuable 
opportunity for training, and every attempt should be made to build up the 
team of investigators at this stage rather than later, even if this involves a 
certain amount of additional expense. 

It is of the greatest importance that investigators, once they have been 
trained and are found suitable, remain in the job. Every effort must therefore 
be made to see that the pay is adequate, and that the work is made as attractive 
as possible. In the case of the interview type of survey, investigators are 
sometimes paid on piece rates at so much a completed questionnaire. This 
is in general unsatisfactory, since it tends to lead to skimped work and to 
irregularities such as substitution of one respondent for another. 

It should not be forgotten that field work of the interview type is very 
arduous and is found by almost all investigators to involve considerable mental 
strain. Hours of work are also likely to be irregular, since if excessive 
non-response is to be avoided some evening interviews are almost inevitable. 
Investigators should therefore not be expected to work excessively long hours, 
and should if possible be given a rest on other work from time to time. It is 
often advantageous to bring full-time investigators to headquarters at intervals 
and use them for office work such as abstraction and analysis of the results. 
This not only serves to provide a break from field work, but also enables them 
to gain a much better insight into the purposes of their work. 

Whatever the conditions of work and form of payment, there must be 
adequate field supervision. The supervisors should themselves undertake 
field work from time to time, so that they are in a position to appreciate the 
difficulties of the work, and should also contact the workers while they are 
actually in the field. Provision should be made for personal contacts not only 
between supervisors and the field investigators, but also between supervisors 
and the headquarters staff. In long-term surveys it is also often advantageous 
to arrange conferences of the investigators from time to time at which difficulties 
can be discussed and the whole Progress of the survey reviewed. 


5.6 Control of the accuracy of the field work 


The best assurance that the field work shall be accurate is that the 
investigators are thoroughly trained in their work, and are capable, conscientious 
and keen. Nevertheless it is important even with the best investigators to 
keep a close watch on the progress of the work. 

In certain cases, particularly in surveys involving observations and physical 
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measurements, it is possible to arrange a system of field checks by the super- 
visors. These should preferably be carried out on a random sub-sample of 
units, and should in any case be conducted in such a manner that the investigators 
cannot know which parts of their work will be checked. Checks of this type 
will not usually be possible in the interview type of survey, as it is clearly 
impracticable to ask for the same information twice from the same individual. 

A preliminary examination of the completed forms must be made as soon 
as possible after they are completed. In this way defective work, in so far as 
it reveals itself in the forms themselves, is brought to light immediately, and 
remedial action can be taken. If the census or survey is such that a large 
volume of work is turned in by each field investigator, and it is not considered 
necessary to give individual scrutiny to all the returns, a proper sample of the 
work of each investigator should be scrutinized as a routine matter. 

The investigators should themselves be instructed to carry out any simple 
numerical calculations that are required on the forms, and also to look through 
the forms before sending them in to see if they are in satisfactory order. On 
the other hand extensive revision of the forms should not be permitted. The 
vestigators is in general undesirable, since 


preparation of fair copies by the in i i 
it leads to copying errors and also makes any judgment on the quality of the 


work more difficult. If fair copies are permitted the originals should be returned 
together with the fair copies, and a certain percentage at least should be checked 
for copying errors and other changes. — 
If the questionnaire is such that the investigator has to furnish or amplify 
the answers to some of the questions from notes taken at the interview this 
should be done immediately after the interview rather than at the end of the 
day, even if this course is somewhat inconvenient. In general, however, it 
is best for the information to be written down in its final form at the interview, 
any supplementary observations by the investigator being given under a separate 


heading. y A r 5 

If comparisons between the different investigators by means of inter- 

penetrating samples have been arranged, the comparative results must be 

made available as quickly as possible, in order that effective action may be 

taken if discrepancies are discovered. On the other hand the use of 
e for the relaxation of 


A : excus 
interpe samples should not be made an À ; 
penetrating P amples are not likely to reveal minor 


other forms of control. Interpenetrating s a l 

defects in an individual investigator, and they will certainly not reveal faults 
which are common to all investigators. They should therefore be regarded 
as a check against major defects in individual investigators rather than as a 


complete control of all investigators. 


5.7 Arrangements for follow-up in the case of non-response 
The follow-up arrangements will naturally vary very greatly according to 
the type of census or survey: f ; 
In the case of a postal questionnaire they w 
different organization from that which is employe 
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itself. Since postal follow-ups from headquarters are of limited utility, some 
form of local organization which can deal with non-respondents by telephone 
and personal visit is required. 

In surveys using field investigators careful instructions must be issued 
in order to be sure the follow-up arrangements are properly carried out. 
Specific warnings should be given against such practices as substitution of 
neighbouring households when there is no response. If the follow-up is to 
be made on a sub-sample only of the non-respondents exact instructions for 
taking this sub-sample must be given, so that it can be obtained as soon as 
the non-respondents are known. In general some very simple sampling method, 
such as taking of every qth non-respondent, is adequate. Such a procedure 
has the advantage that a list for follow-up can be prepared at the time of the 
original non-responses, if necessary by the field investigator concerned. 


5.8 Statistical analysis 

Statistical analyses of the results of complete censuses and surveys are 
mostly based on counts of numbers of units falling in different classes and 
sub-classes, and on the totals for these classes of recorded quantitative variates. 
The units to which these counts and totals refer may be either the sampling 
units or some other natural units. In certain cases totals are also required for 
ratios or other quantities calculated from the values recorded for the individual 
units. 

From these numbers and totals the means can be calculated for the different 
classes. Basic summary tables can then be prepared. In these summary 
tables frequencies based on counts are often expressed in percentages, the 
bases of the percentages being chosen so as to exhibit the differences in 
proportions which are of interest. When the summary tables have been 
prepared, more critical statistical analysis of these tables may be required in 
order to isolate the effects of the various factors which are believed to influence 
the results. 

The treatment of the results of sample censuses and surveys is similar 
in most respects to that of complete censuses and surveys. If, however, the 
sampling fractions are different for the different units, the appropriate weights 
have to be applied at some stage of the calculations. The utilization of 
supplementary information will also necessitate adjustment of the basic totals 
and means. The appropriate formule for these operations, which differ 
according to the type of sampling adopted, are given in Chapter 6. 

In sample censuses and surveys we shall normally require estimates of the 
sampling errors. In addition, investigations of the relative efficiency of 
different sampling methods may be undertaken in order to improve the efficiency 
of future surveys on the same or similar material. 

The various stages in the computations may therefore be classified as 
follows :— 

(1) Preliminary computations on the values of the individual returns, 
such as the calculation of ratios and the introduction of individual weighting 
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factors. If punched cards are used, some or all of this work may be carried 
out mechanically, subsequent to the punching of the cards. 

(2) Abstraction and coding of the results so that they are in a form suitable 
for analysis or for transfer to punched cards. 

(3) Punching (in cases where punched cards are used). 

(4) Counts and totals. 

(5) Preparation of the summary tables from these counts and totals, 
including adjustments for supplementary information and any weighting not 
already carried out. 

(6) Calculation of sampling errors and inyestigations of efficiency. 

(7) Critical analysis. 

Apart from the calculation of sampling errors and investigations of efficiency, 
which are described in Chapters 7 and 8, we do not propose to discuss these 
l in this book. In the following sections we will merely 


operations in detai 
the special points that arise at the various stages. 


give an outline of 


5.9 Methods of handling the data 
There are four main ways in which the data accumulated in the course 
of a census or survey may be handled. These are: 


(1) An analysis direct from the forms. 

(2) Transference of the data to ordinary cards. 

(3) The use of cards with holes round the edges (Cope-Chat cards). 
(4) The use of Hollerith or Powers-Samas cards (punched cards). 


The primary function of any type of card is to enable the data to be sorted 
into different classes, so that the numbers of units and totals associated with 
vithout transcription. With plain cards the 


these classes can be obtained v i i 
sorting has to be done entirely by hand, with Cope-Chat cards marginal 
punching gives some aid to the hand sorting process, while with punched 


cards the sorting is carried out mechanically, and the counts and totals are 


also i chanically. 

Ha pean zi can be recorded directly on cards which are 
subsequently used in the analysis. These may be either ordinary cards or 
Cope-Chat cards. The use of cards in this manner 1s limited by the fact that 
the amount of uncoded information that can be conveniently recorded on a 
card is small, and also by the fact that cards tend to be damaged by use in 


the field. 


It is also possible to record information directly on Hollerith cards, either 


in a form which enables it to be read by the punch operator as the card is 
punched, or in a form that enables it to be punched automatically by the process 
known as mark sensing. The occasions on which either of these methods has 
any real advantage over the punching of cards from ordinary forms are 
somewhat rare in census and survey work. . 

If only a single classification is required, the preparation of a summary 
directly from the forms is likely to be the most economical method of procedure- 


* But see Section 10.1! acy 
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If more than one classification is required, the use of forms may still be 
reasonably economical, particularly for small surveys, but the possibilities of 
using forms in this manner are limited by the fact that paper forms are not 
easily sorted or counted, and will not stand a great deal of handling.* The 
direct use of forms is also sometimes of value when a rapid preliminary summary 
of the salient features of a survey is required. Ina large survey such summaries 
can usually be based on a small sub-sample of the forms. 

If the data are transferred to cards, some form of compression and coding 
is usually necessary. This enables the information to be recorded in compact 
form on the card, and also facilitates the subsequent counts and summation. 
If Cope-Chat cards are used, all information recorded in punched form must 
be coded, and with punched cards the whole of the information has to be 
coded in numerical (or exceptionally alphabetical) form. 

The fact that punched cards have all their information coded in numerical 
form has the disadvantage that the detailed information relating to separate 
units cannot be easily studied by means of the cards themselves. It is also 
difficult to record written remarks on the cards. This tends to make the 
analysis more mechanical. Punched cards are therefore unsuitable for analyses 
which require detailed examination of the whole complex of information 
relating to individual units. Even in surveys which are so large that analysis 
by means of punched cards is essential it is often advisable to arrange that the 
original forms are kept available, so that in any detailed investigational work 


the forms corresponding to selected cards can be extracted and examined 
when required. 


5.10 Cope-Chat cards 


Cope-Chat cards are cards which have a row of holes along each edge, 
A group of these holes can be assigned to each particular classification, e.g. 
the answers to a specific question, each hole being taken to represent one 
class in this classification. The body of the card (front and back) can be used 
for recording written information. 

By means of a punch similar to an ordinary ticket punch, V-shaped notches 
can be cut out of the card so as to obliterate any desired holes. If the cards 
are arranged in a pack and a knitting needle is passed through a particular hole, 
the cards punched in this hole will fall from the pack when the pack is lifted 
by means of the needle and thoroughly shaken. This enables cards to be sorted 
into different classes with considerably greater speed than would be the case 
if the information were merely recorded on plain cards, and the sorting had to 
be carried out by examination of each card. The Cope-Chat method of sorting 
is not fully reliable, since cards do not always fall out of the pack when it is 
shaken, but mis-sorts can be detected by visual inspection of the edges of 
the retained cards. If all classes of a given classification are coded in some 
mutually exclusive system a positive check will be available. 


* In making counts or calculatin, 


g totals from forms it is usually þı 
forms into the necessary classes. y best to sort the 
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The amount of information that can be recorded on the edge of the cards 
is limited, since the number of holes is limited by the size of the card. In 
the Survey of Fertilizer Practice, for example, 5 in. X 8 in. cards are used 
with a hole spacing of just over 4 to the inch, giving 105 holes in all.* 

The punching of Cope-Chat cards is somewhat laborious, and if a mistake 
is made a new card has to be prepared. In this case the written information 
will have to be transferred to the new card. For this reason it is usual to mark 
the holes which have to be punched, and check the markings before actually 
considerable amount of punching is to be done, a form of gang 
sed which will punch a particular hole from a number of cards 
In this case the cards can be sorted into the appropriate 
g checked before punching. A key-operated punch is 


punching. Ifa 
punch can be u 
at one operation. 
classes and the sortin; 


also available. t 
When the cards have been sorted they require to be counted by hand. 


If totals of numerical information are required, the summations must be 
performed on an ordinary adding or calculating machine, unless the numerical 
information has itself been coded. The counting of cards is a tedious operation, 
and is made more so by the punching round the edges. For some purposes 
it may be feasible to replace exact counts either by weighing or by measuring 
the aggregate thickness under a definite pressure. Neither method is very 
accurate, however. In a humid climate, for example, the weights tend to vary 
considerably owing to changes in moisture content. F i 
The use of Cope-Chat cards enables isolated cards having given 
characteristics to be much more readily extracted than is the case when plain 
cards are used. Cope-Chat cards are therefore of value for surveys in which 
units of particular types require to be identified subsequently. In this respect 
they have certain advantages over punched cards, since no elaborate sorting 
mechanism is required and the information concerning the selected units is 


resented in written form. a p 
3 Cope-Chat cards also have the minor advantage that the proportions falling 


in different classes can be roughly observed by sorting the cards and then 
examining the distribution of the notches. A 
The coding of numerical information on Cope-Chat cards can be carried 
out in a ane of ways: If approximate values only are required the data 
may be grouped into size-groups. If exact values are required the simplest 


method is to allocate ten holes to each digit of the number, but this can only 
be d if very little numerical information has to be coded, owing to the 
ee be: is to use some form of two-hole 


imi i card. An alternative 1s to | 
sae memes ak digit. The most compact is that based on four holes 


f -cits 1, 2, 4, 7. To code other digits the two 
which are taken to denote the digits) 5\=1 7 S 
digits whose sum gives the required digit are punched. Thus the punching of 
l and 2 Poon digit 3 This system is not self-checking on sorts, since 
card systems one corner is cut across 
Jaa oe fhe eS and Pate a check that all cards are right way round 
a 


in the pack. 
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two, one or no holes may be punched. If a fifth hole carrying the value 0 is 
added, every digit can be denoted by a pair of holes, with the convention that 
4, 7 denotes 0. An alternative with five holes, which is simpler, but not self- 
checking on sorts, is to use the holes to denote the digits 1, 2, 3, 4, 5, digits 
over 5 being indicated by double punching. 


5.11 Punched cards 


Two different systems are available, known as the Hollerith and Powers- 
Samas. Both systems employ cards in which each column has 12 positions, 
in any one of which a hole may be punched. When required, two or more 
holes may be punched in different positions in the same column. In both 
systems alphabetical information can be dealt with by means of a two-hole 
code. * 

Hollerith installations employ 80- or 38-column cards. Powers installations 
employ 65-, 36- or 2l-column cards. A given installation will only handle 
cards of one size. By using each column for two items of information, with 


a special form of multiple punching, the numerical capacity of cards of both 
systems can be doubled. 


The actual punching of the cards is normall 
key punch. Verification, which checks within 
original punching is correct, is normally 
operated verifier similar in construction to 
of various kinds are also available. 


The main difference between the Hollerith and Powers systems is that in 
the Hollerith system the cards are read electrically, whereas in the Powers 
system they are read mechanically. This results in a greater flexibility in the 
Hollerith system, since the machines can be set up fo: 
by means of electric connections through one or m 
analysis is confined to sorting and counting, 
capacity, have almost identical performance. For the more elaborate types 
of analysis, Hollerith equipment is more suit; R 
particularly in surveys of moderate size where many different types of machine 
operation, which often cannot be planned 
small batches of cards. 


y done by a hand-operated 
certain limitations that the 
performed by means of a hand- 
a punch. More elaborate punches 


(1) The sorter, 

(2) The sorter-counter, 

(3) The tabulator, 

(4) The reproducing summa 
(5) The multiplying punch, 
(6) The collator. 


* See Section 10.1 for notes on some recent developments 
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The descriptions which follow are not intended to give a complete account 
of these machines and their various modifications, but only an indication of 
the way in which they work and the simpler types of operation that can be 
undertaken with them. An expert should always be consulted when planning 
any extensive punched-card work. Initial consultations should take place 
before the coding of the material is undertaken. 


5.12 The sorter and sorter-counter 


The sorter can be set to operate on any one column of the card. When 
the cards are passed through the machine they are separated into 12 boxes 
the 12 positions of the holes punched in this column, with 
for cards with no hole in the column. If, therefore, a code 
representing some classification of the material into anything up to 12 classes 
is punched in the column, the cards corresponding to the different classes 
will be sorted into the different boxes. A classification with more than 12 and 
4 classes can be coded on two columns, and by sorting successively 
separation into all the classes can be effected. 
In the same way, if a group of columns or field is used to denote a number, 
the cards can be arranged in numerical order by sorting first on the units, then 
on the tens, and so on. Equally, if two columns represent two different 
classifications the cards can be sorted into the various cells of the two-way 
classification so formed. If two holes are punched in the same column the card 
is sorted to the higher digit, unless sorting on this digit is suppressed. 

The sorter is normally used for arranging the cards of the pack into groups 
or into a given order prior to their passage through the tabulator. When this 
is done the whole of the cards are kept in one pack, i.e. at the end of each sort 
the cards are collected from the separate boxes and the sub-packs are placed 


together in numerical order. 
If only counts are require 


corresponding to 
an additional box 


up to 14 
on each of the two columns 


d it is possible to obtain these directly on a sorter 
with a counter device which registers the numbers Sie) Cani iie 
various positions in the given column. A machine with this device is called 
a sorter-counter. Sorting can be suspended during counting 1f desired. 
The ordinary sorter-counter counts on a single column only and does not 
print the results. For large-scale census work more elaborate types of sorter- 
counter are available which will count simultaneously on a number of columns, 


printing the results obtained in these counts. 


5.13 The tabulator TNES o ; 
The ; uch more elaborate mac uine than the sorter. Its 
ES Pa H numbers punched in a given field Eom a group of 
cards. To effect this the numbers are read successively as the E s pass through 
the machine, being added on one of a set of counters which form part of the 
machine. The machine has 2 printing device which will ae netan 
accumulated in the counters, and will also, if desired, print numbers read from 
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the cards. The operation of obtaining and printing the totals is called 
tabulation, and that of printing numbers read from the cards is called listing. 

Most tabulators have a number of counters and print banks ; the totals 
of several fields can therefore be accumulated simultaneously. If the numbers 
in the fields concerned are sufficiently small, two or more fields can be 
accumulated in different parts of the same counter, thereby further increasing 
the capacity of the machine. 

In order to enable the totals of the groups of cards in different classes to 
be obtained successively without stopping the machine and without having 
to feed in the groups of cards separately, a device known as the control is 
incorporated in the machine. This device is such that when wired to control 
on a given column, the machine will break control if the card following the 
one that is being added carries a different designation on the control column. 
This break of control stops the adding process and gives the machine certain 
instructions as to printing and clearing, e.g. it can be wired so that the total 
already obtained is printed and cleared before passing on to the next group 
of cards. Thus, if a pack of cards sorted into groups corresponding to the 
code on a single column is passed through the tabulator with the control wired 
to that column, the machine will break control at the end of each group and 
the group totals of any desired field can thereby be obtained. 

The control can be arranged to operate on a number of columns, and 
different stages of the control can be associated with the different columns. 
Different instructions can be given to the clearing and printing mechanisms 
according to which stage of the control is operating. Thus, for example, it 
is possible to obtain totals of main and sub-groups simultaneously by feeding 
the numbers from the given field into two different counters, one of which 
is cleared at the end of each sub-group and the other at the end of each main 
group. 

Counts can be carried out on the tabulator, either in conjunction with a 
tabulation or independently, by what is known as the card count. This feeds 
1 into any desired counter at the passage of each card. The control and printing 
mechanisms operate as before. 

The more elaborate forms of tabulator have a number of auxiliary devices 
which considerably increase their potentialities and flexibility. The two most 
important in the British machines are the rolling feature and distributors. In 
the rolling total tabulator, numbers can be transferred or rolled from one counter 
to another, either positively or negatively, according to instructions issued by 
the control mechanism. Distributors enable numbers read from a field to be 
directed to different counters, and also enable numbers taken from one counter 
to be directed to different counters in rolling, or to different print banks, 
When used in the first manner the distributors operate on instructions read 
from some other column of the card. 

A single distributor, for example, enables positive and negative numbers 
in the same field to be distributed into two counters according to their sign 
(punched in code in another column). By rolling the total of the negative 
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counter negatively into the positive counter at the end of the group the correct 
total (or its complement if negative) is obtained. A further device replaces 
the complement by the negative total on printing. 

By using four distributors it is possible to make counts of classes represented 
by a code in a single column without sorting on that column. In this case 
the card count is fed through the distributors which are controlled by the 
punching of the column in question. The wiring is so arranged that the card 
count is directed to a different counter or part of a counter for each of the 
12 possible code punchings. 

"The use of a tabulator in place of a sorter-counter for counting, either with 
or without distributors, has the advantage that the results are obtained in 
printed form. It also enables the whole pack of cards to be kept together, 
since the different main classes are automatically separated by means of the 
control. 

It should be noted, however, that when there is multiple punching in the 
column being counted, the tabulator cannot be used for counting if only four 
distributors are available. A sorter-counter counts all holes punched in a 
column, e.g. if 4 and 8 are punched both the 4 and the 8 will be counted, 

Rolling can be used to carry out simple multiplications. In the National 
Farm Survey analysis, for example, a variable sampling Sees values 
1/20, 1/10, 1/4, 1/2, 1/1, was used for the different size-groups. ae 
of the size-group totals by their appropriate raising factors was effected as 


follows :— 


iplicati lled into itself. 
a) For multiplication by 2 the total was ro 
o For multiplication by 4 operation (a) was repeated. ia 

(c) For multiplication by 10 the total was rolled through a distributor so 


i position by one place. 
(a) oe E i by 20 oE (a) and (c) were combined. 
res and products, which are required for the estimation of 
be line pace set coefficients, etc., can Ea obea z a apilan 
by what is known as progressive digiting. Suppose p ae ‘a the pro hen o 
two sets of numbers 4 and B is required, and that a nn ne sing! S igit 
numbers. The cards are sorted on A and then fed through the tabulator, 
Ps first the B’s being summed and the progressive total printed (without 
leaning) hange of digit of A. The sum of these progressive totals 
ca foe e sum of the products of A and B. 


ae 0 total) gives th i 
e this summation can be carried out on the tabulator, 


i i xtra cards to see that every digit 
i py the insertion of ex ds jat ev 
pate sere ae contain more than one digit ae digit is tari 
Ei are combined. 
Sealy with multiplication by 10, etc., before the results mi 


z : ted in full on the larger rolling total 
The whole of this operation can be TEE sums of products of Æ with 


tabulators. Subject to men P APA can be carried out simultan ak 
itself and several other numbers B, “> > 


without additional sorting or tabulation. 
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The American tabulators do not have the rolling feature, but are capable 
of carrying out direct addition or subtraction according to card designation. 
In contrast to the rolling total tabulator, the counting wheels may be grouped 
in counters entirely at will, which enables the counter capacity to be used 
more efficiently. These tabulators, however, have no distributors, which 
detracts somewhat from their usefulness in the analysis of survey data. 


5.14 The reproducing summary punch 


The reproducing summary punch has two main functions. One is 
reproducing information from a pack of cards on to the corresponding cards 
of another pack. The second is what is called gang-punching, that is, the punching 
of information, read from the first or master card of the group, on the whole 
of a group of cards. The punch can also be used in association with a tabulator 
to punch on to new cards the results obtained in the course of a tabulation. 

In reproduction the information in any set of columns can be transferred to 
the same or any other set of columns. In gang-punching the information will 
normally be punched in the position in which it appears on the first card of the 
group. If selectors are fitted to the reproducing punch, transference to other 
columns (offset gang-punching) is possible for a number of columns not exceed- 
ing the total number of points available on the selectors. 

The reproduction of information from one pack on to another is of value 
in survey work in a number of ways. In addition to the obvious function of 
making a new pack of cards when an old one has become worn, it can be used 
to bring together on to a single card items of information referring to the 
same unit and recorded on two or more Separate cards, so that the association 
between these items can be analysed. It also provides a satisfactory means of 
entering new information on to cards that have already been punched. Instead 
of punching the new information on the old cards directly, this information is 
punched on to a new pack, together with the code numbers of the units, and 
the information from the new pack is then transferred to vacant columns on 
the old pack or vice versa by means of the reproducing punch. 

When transferring information from one pack to another pack which itself 
already carries information, the two packs must of course be sorted into the 
same order. The machine, however, checks that there is correct correspondence 


between each pair of cards, and also checks that all transferred information is. 
correctly punched. 


Gang-punching can be used to save hand 
which are to be punched all carry the same c 
can also be used to transfer information from the main card on to secondary 
cards or trailers referring to the same unit. In this case the main cards act 
as master cards. In gang-punching with interspersed master cards, the master 
cards must carry an X in one of the columns in which none of the remainder: 
of the cards carry an X. If such an X is not already punched it can be gang- 
punched. 


punching when a batch of cards. 
ode in a number of columns. It 
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A further use in survey work is the calculation of percentages, index 
numbers, etc. A simple example will illustrate the procedure. Suppose it is 
required to express the numbers B as a percentage of the numbers A, both 
sets of numbers being punched on the cards. By suitable sorts we can assemble 
the cards in batches such that the percentage value for all the cards of each 
batch is the same. Cards for which Æ has the value 45, for example, and a 
value of B between 0 and 2 will have a percentage value of 0 (to the nearest 
10 per cent.), those with B between 3 and 6 will have a percentage value of 
10, etc. Master cards are therefore made out carrying the values 


A B Percentage 
45 0 0 
45 3 10 
45 7 20 


etc., with similar sets for other values of A, and these are added to the pack, 
which is then sorted into numerical order of the A’s, and of the B’s within 
As. All the 4’s having a value of 45, for example, will now be together, those 
with B between 0 and 2 being preceded by the first of the above master cards, 
those with B between 3 and 6 by the second, etc. If the whole pack is then 
passed through the reproducing punch the correct values of the percentages 
will be gang-punched into the remaining cards from the master cards, 

The disadvantage of this procedure is that a large number of master cards 
are required to cover with any high degree of accuracy fields which are at all 
extensive. Time and expense is therefore involved in the preparation of the 
cards, and they also add to the total volume of sorting required. The method 
is therefore most suitable when ratios and indices of low accuracy are required 
for large batches of data. In the analysis of the National Farm Survey it was 
used for the calculation of the percentage of paras of individual farms 
which was arable (12 classes), rent per acre (12 classes), etc., and for the 
combination of several items of qualitative information into a single index. 
A description of the procedure, and a method of preparing master cards by 


use of the gang punch, is given by Kempthorne (1946, B). 


5.15 The multiplying punch 


The multiplying punch is designed 


h the resul jute : 
Bae Va product ee ‘an also be set to read the multiplier from inter- 
CR ae mbers on the whole of a group of cards 


spersed master cards, so that the nu R ltiplyi i 
pags » . Cross-footing multiplying punches will 
are multiplied by the same factor. Ct read from the same card and punch 


also add or subtract two or three numbers 
the result, or add one or two numbers to the product of two other numbers. 


: ivi in survey work is in the calculation 
Th. chief usean the mur ET E EE It can be used for the 


of products of various kinds prior to he divisors are punched on master 
calculation of ratios if the reciprocals of the a Jt is, however, 
cards and the cards are then sorted according to the! d 
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relatively slow in operation. Consequently when a large amount of material 
has to be handled and high accuracy is not required in the products or ratios, 
the use of gang-punching with interspersed master cards is generally to be 
preferred. 


5.16 The collator 


The collator will take two packs previously sorted in numerical order on 
up to 16 columns and will combine them into a single pack, so arranged that 
all cards of pack B which carry a given code-number follow immediately on 
the cards of pack A carrying the same code-number. It will also select matching 
cards from pack A and pack B, rejecting cards of pack A with code-numbers 
which do not occur in pack B and vice versa. Furthermore it can be used to 
select cards from one pack which correspond in code designation to the cards 
contained in a second pack. This last property is sometimes of value in survey 
work, since it can be used to pick out trailers associated with a given set of main 
cards. If it is so used, however, the precaution should be taken of matching 
up the rest of the trailers with the remaining main cards so as to provide a 
check that no trailers have been erroneously excluded. 


5.17 Systems of coding for punched cards 


When punched cards are used all non-numerical information will require 
to be coded in some quasi-numerical form. Each column of the card can 
represent a classification of up to twelve classes, which are denoted by X, Y, 
ORE wath 

Numerical quantities do not in general require to be coded, but may 
require rounding-off in order to economize card space and reduce the counter 
capacity needed in the subsequent tabulations. Rounding-off, however, is a 
tiresome operation and reduction of a number by a single digit should only 
be undertaken if really necessary. Often rounding-off can be avoided by 
issuing suitable instructions to the field investigators regarding the number of 
figures required in the results. 

In certain cases it may pay to code a numerical quantity by grouping. This 
is particularly advantageous when the quantity is primarily required as a basis 
of classification and does not need to be summed. If summation is needed, 
large values must not be too coarsely grouped, and there must be no “ open” 
group : it is not sufficient for all the high values to be included in an “ over — ” 
class. This often necessitates an additional column or over-punching in the 
X and Y positions. If there is to be summation the grouping must also be 
chosen so as to avoid the bias which can arise through the frequent occurrence 
of particular values (see Example 7.2.b), though Such bias is rela 
unimportant when only comparative results are required, 

The construction of a code for complicated items of information, e.g. 
questions with a large number of possible alternative answers, is often a difficult 
task, since the conflicting aims of recording the information adequately, 
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simplifying the actual task of coding, and keeping the code compact have to be 
reconciled. In many cases the most suitable form of coding for a given item 
of information can only be devised by examination of a sample of the 
questionnaires or returns. 

Compactness of coding is important not only in saving card space, but also 
in simplifying the tasks of sorting and grouping the cards into classes. With 
a two-column code, other than the type described in the next paragraph, both 
sorting and counting are more complicated than in a one-column code. 
Furthermore, classes which have been separately coded cannot be grouped for 
purposes of tabulation unless a group code is gang-punched or control is 
omitted from the code columns, since the control will automatically break at 


each change of code. 
For this reason it 0 
main and subsidiary. Thus in an 


ften pays to code an item of information in two parts, 
agricultural survey of England and Wales 
the place location will be given by one of 61 counties or part counties. 
Instead of numbering the counties consecutively the provinces can be numbered, 
and the counties numbered consecutively within provinces. This will facilitate 
sorting and tabulation of the material by provinces, and the code will still only 
require two columns. 

Provision should always be made in a coding scheme for recording lack 
of information on items for which this contingency is likely to arise. Leaving 
the column in question blank is not satisfactory, as every occupied column 
should be punched on all cards. For the same reason numbers of under 100 
in a three-column field, for example, should commence with 0, not a 
blank. 

If there are a large number of questions to which only two or three 


alternative answers are possible the columns required on the punched card 
can be reduced by combining the answers to two or more questions in a single 

‘i $ ; 
code. Thus two questions, each with the alternative answers yes, no, don’t 


know, can be coded in one column by using the 9 combinations of the answers. 
Such coding, however, is decidedly more troublesome and liable to error, 
and is also less convenient for subsequent analysis. In large-scale surveys, 
therefore, it is often better to keep such questions separate even if this means 


using an additional card. 


In questions in which the answers are not mutually exclusive, multiple 


answers are usually recorded by multiple punching, ie. punching in a single 
column the holes corresponding to all the answers given. , 

Multiple punching can also be used to economize card space iu other ways. 
A two-hole code, for example, is used for alphabetical aa å ampat 
classifications, if they contain sufficiently few classes, can be punched on di r 
parts of the same column. This, however, should only be done if it is reasonably 
certain that only counts of these classifications will be required, and that they 


i the control of tabulations. 
a s bers in a field are below, say, 1000, 


In cases in which the majority of num 
but there “are a few numbers which are between 1000 and 3000, numbers 
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between 1000 and 1999 can be denoted by overpunching X in the first column 
of the field and numbers between 2000 and 2999 by overpunching Y. Certain 
tabulators are fitted with a device (the 29 feature) which enables numbers 
so punched to be added directly in the course of the tabulation. If such a 
device is not available the X’s and Y’s will have to be separately counted and 
adjustments made. 

Multiple punching should not be used excessively. It slows up the 
punching, is difficult to verify, and introduces complications into the sorting 
and tabulations. If the data cannot conveniently be coded in some simple 
form that will go on a single card with little multiple punching, it is usually 
better to use an additional card, recombining the data as required by means 
of a reproducing punch. 

A way of avoiding multiple punching when dealing with occasional numbers 
which exceed the allotted capacity of the field concerned is the use of trailers. 
Thus a number 2571 can be recorded in a three-column field by the use of two 
trailers, the numbers 999, 999 and 573 being punched. Apart from these 
numbers only the code number need be hand-punched on the trailers, the 
remainder of the information being gang-punched from the main card. 
Trailers must be distinguished from main cards. If this is done by punching 
0 and 1 respectively on some column the 1 can be used for counts ; this, 
however, requires an additional column. If X and Y are used in some occupied 
column an additional distributor will be required for counts. Alternatively 
the trailers can be removed when counting. 

Simple qualitative information can often be pre-coded on the questionnaire 
form. Thus a question to which the only alternative answers are yes, no, 
don’t know, can have these answers printed on the form in conjunction with 
the numbers 1, 2, 3. The investigator is then instructed to ring the appropriate 
code number. If it is necessary to make provision for possible non-standard 
answers partial pre-coding can be used, a line being left for alternative answers 
which are subsequently coded in the office. 

Pre-coding has the advantage that the amount of office work is considerably 
reduced, since the forms can be sent for punching after scrutiny without further 
work. It must, however, be confined to questions to which the alternative 
answers can be printed on the form. It is unsuitable in the case of questions 
to which complicated and involved answers are likely to be received. If pre- 
coding is used in such cases there is a danger that the recorded answers will 
be excessively stereotyped. 


5.18 Arrangement of information on punched cards 


No serious problems of card arrangement arise when the sampling units 
are the natural units of the population and the whole of the information on a 
unit can be coded on one card. The order in which the items are arranged 
on the card is immaterial in the Hollerith system. Consequently the order 
which is most convenient for punching may be adopted. Blank columns 
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between columns which are punched should be avoided by arranging all the 
blank columns at one end of the card. 

It is generally advisable to leave a few blank columns in order to accommodate 
gang-punching of such items as index numbers and grouped classifications 
required in the course of the analysis. The number of blank columns likely 
to be required depends very greatly on the type of work. They may not be 
necessary at all in simple censuses for which the exact form of analysis can 
be prescribed at the outset. If in the course of the analysis it is found that 
additional columns are necessary, these can be made available by reproducing 
the cards, with the omission of items of information which are no longer 
required (or are not required in association with the additional columns). 

If more than one card is needed to accommodate all the relevant information 
on a unit, it is necessary that items of information that require to be associated 
in the analysis should ultimately appear on the same card. Apart from the 
code number each item of information need only be punched on one card, 
the requisite items being transferred to other cards of the set by the reproducing 
punch. In certain cases entirely new cards may require to be constituted in 
this way, but in others sufficient blank columns may be left on the cards 
originally punched to accommodate the additional items. Convenience of 
punching should still be one of the prime considerations in the arrangement 
of the original cards. It is generally better to make an additional set of cards 
after punching than to separate information which falls in a natural punching 
sequence. 

If the units which will form the basis of the analysis are not the sampling 
units, or if there is a hierarchy of units, the card arrangement presents more 
difficult problems. This situation is of fairly frequent occurrence. As an 
example we may consider suitable card arrangements for surveys of human 
populations in which the household is the sampling unit, and in which both 
households and individuals require to be treated as units in different parts of 
the analysis. , 4 

Tf the survey is a simple one, in which the whole of the information relating 
to the household can be accommodated in say 20 columns, and the whole of 
the information relating to each individual in say 10 columns, it will be possible 
to accommodate all the information relating to a household of up to five 
individuals on one card, leaving 10 columns blank for subsequent use. With 
this arrangement of the card, households of 6 to 10 individuals will require 
one trailer, 11 to 15 two trailers, etc. 


i t has the disadvantage that tabulations relating to individuals 
ity eae es of the cards through the tabulator, with sorts 


E te passa 5 RN 
require five separate passag dividuals may appear in every one of five divisions 


between each passage, since in ; ok 
of the card. This does not necessitate the passage o: a larger total number 
of cards through the tabulator than if each individual were represented on a 


separate card, but it does result in the summaries being produced in five 
separate parts, These will then have to be combined by hand, or by punching 


121 


SECT. 5.18 SAMPLING METHODS FOR CENSUSES AND SURVEYS 


and tabulating summary cards. The total time of tabulation will also be 
somewhat increased owing to the increased number of printing and clearing 
cycles. 

A The alternative is to have a separate card for each individual. This will 
in any case be necessary if the amount of information relating to individuals 
is such that more than 10—15 columns are required per individual. It is usually 
necessary to haye at least some of the information relating to the household 
reproduced on each individual card. If the household information requires 
say 40 columns and the individual information 30 columns, the household and 
the first individual in it can be punched on the first card. For the subsequent 
individuals only the code number of the household need be punched, the 
remainder of the household information being gang-punched subsequently. 
For convenience of punching, the code number should be so placed on the 
card that it is contiguous to the individual information. Each individual 
should also be allotted a serial number within the household in some orderly 
sequence, e.g. head of the household, wife, children by age, and other members 
by age. If still more space is required for the household information a separate 
card or cards will have to be given over to the household, with a selection of 
this information gang-punched on the cards for individuals. 

The use of a separate card for each individual has one serious drawback. 
Although the whole of the information relating to a household and to all the 
individuals in it is recorded on the cards, it is impossible to classify households 
according to the collective characteristics of the individuals contained in them. 
We can of course pick out households containing one or more individuals 
having a given characteristic, e.g. we can select all households containing babies 
of under a year old by selecting the cards representing such babies. But we 
cannot, for example, classify the households according to number and age of 
children, unless this information has already been coded and recorded in 
summary form on the household part of the card. 

In order to enable households having given collective characteristics to be 
picked out on the sorter, it is necessary for the whole of the relevant information 
concerning all the individuals in the household to be recorded on a single 
card. If, therefore, individuals in the same household are spread over more 
than one card we must construct a new set of household cards containing the 
relevant particulars of all individuals in the household. Some upper limit 
must be imposed on size of household, households of above this size being 
dealt with by hand where necessary. Thus with a 5-column field for household 
code number, and the reservation of 15 columns for subsequent recording of 
new classifications, it is possible to allot 5 columns to each of 12 individuals, 
The construction of such cards can be effected by reproducing on to the first 
set of 5 columns of the new cards the relevant columns of all individuals 
numbered 1, together with the household code numbers, followed by all 
individuals numbered 2 and so on. When the new household cards have 
been constructed they can be classified and machine coded according 
to their characteristics by sorting and gang-punching, using master cards. 
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Finally the new codes can be transferred to new household cards, together with 
the required household information. 

This process is not necessarily simpler than the alternative of coding by 
hand from the original forms. In a small survey hand-coding will probably 
be more economical, particularly if the collective characteristics that require 
to be coded are known at the outset. Each case must be judged on its merits 
as it arises. 

A further alternative, if the data on the original forms is not easily accessible, 
is to use the tabulator to list the relevant particulars of all individuals family 
by family, the coding being carried out by hand from this list. 


5.19 General remarks on the planning of the computations 


The preceding sections indicate the potentialities and limitations of the 
various methods of handling survey material. The methods of computation 
which will be most appropriate in given cases depend not only on the type 
of material, on the scale of the survey, and on the analysis required, but also 
on the equipment and personnel available. When punched card equipment 
is not readily accessible, for example, it may be better to use alternative methods, 
even though punched card equipment would otherwise be appropriate, — 

The extent to which it is advisable to carry out preliminary computations 
of index numbers, etc., on punched card equipment is also a matter which 
depends on the relative availability and cost of ordinary computing labour 
and punched card equipment. By suitable preliminary computations it 1s 
often possible materially to reduce the amount of work that has to be carried 
out on punched card machines. ‘Thus, if only totals or means of multiple 
measurements taken on the same unit are required in the analysis it will clearly 
be better to obtain these before the cards are punched, punching only the 
results on the cards. On the other hand, such items as age at marriage can 
frequently be conveniently obtained from date of marriage and date of birth 
by gang-punching with interspersed master-cards, The question of the extent 
to which computations of this kind can be economically mechanized and carried 

yen circumstances is a very technical 


t hed card machines in any gi hn 
mad hee ata of an expert should always be sought before final decisions 


are taken. 


In certain types of survey a good deal of preliminary computation is required 


o e they can be abstracted and coded. „Thus in the Survey 
of T ates (Section 4.23 and Example 6.19), which has been analysed 
on Cope-Chat cards, the applications of fertilizer are usually stated by the 
farmers in the form of so many hundredweights per acre of a given compound. 
From the dressing per acre and the chemical composition of the fertilizer 
it is necessary to work out the dressings per acre and the total dressings of the 
different plant nutrients, nitrogen, phosphate and potash. In the earlier surveys 
the weighting factors, which depended on the number of the fields and on 
the overall sampling fractions, were also introduced at this stage, both the 
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acreages and the total dressings on these acreages being raised by these factors. 
In the later surveys a method of grouping the data according to the size of the 
weighting factor was adopted. 

There is always a danger of over-mechanization in the analysis of survey 
material. The use of punched cards in particular can easily lead to a stereotyped 
and uncritical form of analysis. If surveys of the investigational type are 
analysed by punched card methods provision should always be made for further 
tabulations, etc., the need for which may become apparent when the results 
of the first tabulations have been examined in detail. 

The degree to which an analysis can be planned at the outset depends 
very much on the type of survey. In a simple census type of survey the 
categories of information required may be determined at the outsct by 
administrative requirements, and in this case the whole of the analysis can 
often be planned in advance. In surveys of the investigational type, however, 
it is only when the results have been examined and subjected to preliminary 
analysis that the most appropriate form for the final analysis will become 
apparent. 

Even in surveys of the census type the information collected often forms 
a suitable basis for more detailed and critical statistical investigations. The 
possibility of such investigations should be considered at the outset when 
planning the coding of the material, so that the information will be available 
in accessible form if required. Items of information should not in general 
be omitted from the coding merely on the ground that they are not required 
for the primary analysis. The general aim should be to summarize the whole 
of the relevant information in coded form, so that should new needs arise or 
should the primary analysis indicate that further analysis is likely to be of 
value, the work can be undertaken without re-coding or the preparation of 
new cards. 


5.20 Control of numerical accuracy in the analysis 


The attainment of a high standard of accuracy in computational work is 
extremely difficult, and demands most careful organization of the checking 
procedures and scrupulous attention to detail at all stages. Moreover a reliable 
checking system can only be devised by careful study of the types of error 
which are likely to remain undetected. i 

Numerical work may be checked by repetition, by cross checks, or by 
using different methods of computation to arrive at the same results, 
Occasionally a comprehensive check which completely checks a large piece of 
computation is available, e.g. the solution of a set of linear equations must 
satisfy these equations. Reliable comprehensive checks and checks based on 
different methods of computation, however, are unfortunately rarely available 
in census and survey work. . 

Checking by repetition may consist of working over the same computation 
a second time, or carrying out the computations in duplicate. The original 
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and check computations may be carried out by the same or by different 
computers. 

In a computation which is checked by working over the figures a second 
time the main causes of error may be classified as follows :— 


(1) Failure to check a value which is in error. (This includes partial failure, 
e.g. recalculation of the numerical value without checking the position of the 
decimal point.) 

(2) An identical error in both the original and check computation. 

(3) Different errors in the original and check computation which produce 
the same error in the next written or examined figure. 

(4) Failure to notice disagreement between the original and check 


computations when the original is in error. í 
(5) Alteration of the original to agree with the check when the check is 


in error. . i 
(6) Failure to carry forward correctly corrections necessitated by a 


detected error. PE T 
(7) Incorrect procedure in the original which i 


The danger of identical errors is obviously considerably greater when the 
out both computations. Indeed at first sight it might 
bly high standard of computing is attained, the chance 
of two computers making an identical error would be somewhat remote. There 
are, however, certain errors which are particularly common, such as, for 
example, the mis-reading of a badly written figure, incorrect location of the 
decimal point, the reversal of a pair of figures, e.g. 49,876 for 48,976, and 
duplication of the wrong figure, e.g. 74,496 for 74,996. 

Duplicate computations, if properly carried out, i.e. not compared too 
frequently and corrected independently if an error 1s detected, will very greatly 
reduce the chances of most of the above types of error. Indeed a properly 
conducted set of duplicate computations done by different computers of 
reasonably high standard may be regarded as sufficiently accurate for almost 
all census and survey calculations. The checking of a single set of computations, 
however, even by different computers, cannot be expected to eliminate all 
errors, and if no other checks exist must be looked on as an unsatisfactory 


i tions. 

rocedure for the more important computa 

P Fortunately in many types of computation we do not have to rely solely 
on checks by repetition. A great deal of census and survey analysis is subject 


ious kinds. Thus the counts relating to each of a number 
9 rs cheeks ofa cane tl coun. The same true 
of totals of quantitative measurements. Indeed, where cross checks of this 
kind are not available it is often best to check a set of totals by calculating 
the grand total rather than by checking every individual total. The use Fa 
grand total requires only a single comparison, which can cons auen nE made 
with some care. If all the individual totals have to be checked, there is serious 


danger that some discrepancy may 


s followed by the checker, 


same computer carries 
appear that if a reasonal 


be missed. 
125 


SECT. 5.20 SAMPLING METHODS FOR CENSUSES AND SURVEYS 


The use of cross checks in place of detail checks has a further advantage 
which is not so immediately apparent. If such a check fails to agree a good 
deal of re-computation is usually required to locate the error. This 
automatically tends to raise the standard of computation. 

It is important to recognize which types of error will be detected by cross 
checks and which will not be so detected. If a total check is relied on, for 
example, all entries in a table are checked, but their locations in the table are 
not checked. Thus it is possible for quantities to be entered under the wrong 
headings. Such errors can be minimized by observing a standard order in all 
tables and always entering the values in the standard order. 

One of the main functions of any checking system is to preserve a high 
standard in the computations. Very rigorous standards of work should be 
imposed. If more than a very few errors are found to exist a complete 
re-computation should be made. No erasures or fair copies must be permitted, 
and thorough inspection of all alterations must be made to sce that errors 
are properly rectified. In large-scale routine work a record should be kept of 
the errors made by the different members of the staff. The supervisor must 
be ready at all times to resolve difficulties of procedure, otherwise the computers 
will undoubtedly attempt to resolve such difficulties amongst themselves, 
possibly incorrectly. A high standard of neatness must be insisted on. All 
figures must be legible and unambiguous not only to the writer but to others. 
This is particularly important in coding. Confusion between 6 and 0, and 
between X and Y, gives rise to many errors. 

The coding and punching of the data of a large-scale census or survey 
presents its own organizational and checking problems. Even if a good deal 
of the information has to be coded it is advisable to record the coding on the 
questionnaire forms if possible, since transcription of pre-coded and numerical 
information is thereby avoided. If this is not possible a coding sheet may 
be used. This consists of an auxiliary printed form on which the information 
is entered in code, the form being so arranged that it is both convenient for 
the punch operators and for use in minor hand-analyses, if these are found to 
be required. : 

In certain cases the field investigators can be asked to code their own 
material at the end of each day’s work. In general, however, this is not 
likely to be satisfactory, as it is difficult to preserve consistent standards of 
coding. 

If the coding is at all difficult it is best to code one or a small group of 
items of information on a batch of forms at one time, rather than to code the 
whole of each form in turn. Whatever the detailed procedure adopte: 
it is essential for the supervisors to carry out adequate checks to 
correct and consistent standards are maintained. 

In addition to routine checks it is often possible to impose checks of various 
kinds for gross errors and inconsistencies of coding. A special type of sorter, 
which picks out cards carrying a given code in a number of columns, can be 
used for this purpose. 


d, however, 
ensure that 
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Furthermore it is sometimes advisable to pick out extreme values and list 
the relevant particulars, so that it can be ascertained whether these values look 
reasonable. * 

In the above remarks we have been primarily concerned with the human 
element. It must not be assumed, however, that punched card equipment 
operates without error. The mechanism may fail in various ways, and such 
failure may be momentary only. 

In order to devise adequate checks which at the same time are not unduly 
laborious, a full understanding of the mechanism is necessary. Thus, for 
example, if the control is operating on the columns on which the cards are 
sorted any mis-sort will be detected from the printed record, since the control 
will then break. Nevertheless sorting should always be given a preliminary 
check, either by passing a needle through the corresponding holes on each 
batch of cards, or, if the batches are small, by visual inspection, holding the 
batch up to a light. 

The correctness of the totals provides adequate checks for much of the 
work on the tabulator, but it must be remembered that such totals are not 
fully checked by a single run of the cards through the machine, even if sub- 
totals are accumulated on one counter and the grand total on another, since 
there may be a faulty reading of a card. Equally, if a series of progressive 
totals are taken on a counter, the fact that the grand total is correct does not 
mean that all the progressive totals are correct, since the printing mechanism 
may have printed one of the numbers incorrectly. 

The above remarks should not be taken to imply that any large number of 
errors are to be expected with punched-card equipment, but only that it must 
not be assumed that every sort and every printed figure is necessarily correct. 

Whatever the methods of calculation, the final results of every analysis 
should be carefully scrutinized for apparent inconsistencies and irregularities, 


and any anomalous values should be thoroughly investigated. j 
Since the numerical material handled is itself in general subject to errors 


of various kinds, absolute numerical accuracy need not necessarily be attained 
at the early stages of the calculations. For this reason it is sometimes practicable 
to impose sample checks on such operations as punching and coding in large- 
scale work. If such checks are relied on, however, it 1s essential that steps 
are taken to prevent gross errors (Deming et al, 1942, B). X 

Finally it should be emphasized that different types of work and different 
stages of the calculations demand very different standards of accuracy. A 
single misclassified card in a count, for example, will usually produce an 
entirely trivial error in the results. But the mispunching of a number 
representing a quantitative character, e.g. 610 for 010, may produce a serious 
error in the resultant mean or total of the class in which the unit falls. Errors 
in the final stages of the calculations are always likely to be more serious than 
those in the earlier stages. For this reason, in important work duplicate 
computations should be insisted on for the final stages, and these duplicates 
should themselves be used to check the typed or printed tables of the report. 


*See also Section 10.13. 127 
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5.21 The use of sampling in the statistical analysis 


In certain cases it is possible to attain the necessary accuracy and at the 
same time to reduce the volume of numerical and machine work by analysing 
a sample of the available data. Such sampling may be applied to the data 
from a complete or sample census or survey. 

At first sight the use of sampling in this manner appears illogical, since 
it might be argued that if the collection of the information on the whole of the 
population or on a large sample was justified, its inclusion in the analysis is 
also justified. This, however, is not always the case, since a complete census 
or a large sample may have been taken in order to furnish information on 
individual units or on small groups of units, while the further analysis may 
be required to elicit information which does not require to be broken down in 
detail. Moreover it does sometimes happen that excessively large samples are 
taken which can well be reduced before analysis. 

As already mentioned, the analysis of a sample of the returns is also of use 
in providing preliminary results for a complete or sample census, even though 
the whole of the material will ultimately require analysis. 

Furthermore, when a large sample has been taken for administrative 
purposes, supplementary analyses of the investigational type can often best 
be undertaken on a sub-sample of the original sample. The reduction in the 
total volume of material to be handled is of particular value in such analyses, 
since they often require the application of relatively complicated statistical 
processes. Special points which emerge and on which a higher accuracy is 
desired can be re-tabulated subsequently by using the whole or a larger 
sub-sample of the material. 

The actual technique of obtaining a sample suitable for analysis is usually 
relatively simple. For many purposes a systematic sample of every gth return 
is all that is required. In some cases, however, the use of a variable sampling 
fraction is advisable. This is particularly the case in the analysis of census 
returns referring to economic institutions, factories, farms, etc., since these 
are usually of very variable size. 

An example of an analysis of this type is provided by the National Farm 
Survey of England and Wales (Ministry of Agriculture, 1944, G), which covered 
all holdings in England and Wales of over 5 acres. Sampling was not used in 
the survey because records were required for each individual farm, both for 
administrative purposes and for detailed studies of small areas, A map of the 
boundaries of each farm, for example, was one item of information which 
was collected. 

For the purpose of obtaining a general summary of the Tesults by counties, 
types of farming, etc., the analysis of the whole of the material was unnecessary, 
The holdings were therefore divided into size-groups and a systematic sample 
stratified for counties and size-groups was taken, using a variable sampling 
fraction for size-groups. The sampling fractions, and numbers of holdings 
in the population and in the sample, are shown in Table 5.21, 
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TABLE 5.2I1—ANALYSIS OF THE NATIONAL FARM SURVEY : 
CONSTITUTION OF SAMPLE 


Siegonp | Average sie | Novof | acia | midias 
(per cent.) in sample 
5- 25 12 101,450 5 5,072 
25-100 55 111,360 10 11,136 
100-300 165 65,210 25 16,302 ' 
300-700 413 11,150 50 5,575 
Over 700 1,035 1,430 100 | 1,430 
290,600 (13-6) 39,515 


Had a uniform sampling fraction been used in place of a variable sampling 
fraction, a sample over twice as large would have been required to give results 
„Of the same accuracy on such items as the percentage of land under different 
systems of tenure. By the use of a variable sampling fraction results of ample 
accuracy were obtained from an analysis covering only one-seventh of all the 
holdings. This not only considerably reduced the amount of coding and 
machine work, but also enabled work to proceed as soon as the information ' 
for the sample farms had been assembled and abstracted. In consequence 
it was possible to make the results of the analysis available a year or two sooner 
than would have been the case had the whole of the material had to be abstracted 


before analysis. * 


5.22 Adjustment of the results t 


When the sampling procedure is defective in one respect or another, 
results in order to compensate for 


attempts are sometimes made to adjust the d ¢ 

the defects. Thus it may happen that owing to defects in the selection of the 

sample or in the collection of the information, different oe ef bie populace 
in i i in the final sample. In 

are found ted in incorrect proportions 1n aà 

ee ee ighting the different classes 


Such cases it is possible to adjust the results by we 


FE in the proportions. 
in such a o compensate for the errors in 
act Be p ly distinguished from the procedure of 


This proc must be clear! 5 nai 
stratification chee wales mentioned in Section 3.3. The ae gE tie 
latter procedure depends on the fact that the sample as a WI ot is m ny 
and therefore the selection from within strata 1S ae ee ots in the 
Proportions in the different classes are different because Oe from ithitn 
sampling procedure, however, it is most unlikely that the s ene evia 
these classifications will be fully random. Any adjustment of the T : seed 
therefore, although it may somewhat improve matters, must not be exp 


to eliminate by any means the whole of the defects. Fee Aue 
ti * A further example is provided by the 1 per See of the 1951 Census of the 
nited Kingdom, described in Section 10.16 (1952, ©’. 5 
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Stratification after selection is a special case of the use of supplementary 
information of all kinds. Such adjustments, whether planned at the outset 
or decided on subsequently after examination of the data, are quite justified. 
The essential difference between these adjustments and between adjustments 
of the same type made in order to compensate for defects in the sampling 
procedure is that in the former the selection is random, except for permissible 
restrictions, whereas in the latter it may be biased in various ways. 

In general, if the sampling procedure is defective it is best to report the 
results obtained without adjustment. At the same time data should be given 
indicating, so far as is possible, the deviations of the sample from the expected 
distributions. Thus if the proportions in the different classes of a classification 
are known for the population these may be presented alongside the parallel 
classification of the sample. Similarly the sample means of quantities for 
which the population means are known may be presented for comparison. 
Occasionally an adjustment of some of the more important values derived from 
the sample may be considered worth while, but in such cases the unadjusted 
results should also be presented. 

The above remarks apply primarily to samples for which the sampling 
procedure is markedly defective. In cases in which there are slight defects, 
such as a minor degree of non-response, the application of some small 
adjustment, if this appears necessary, is more justified. If such adjustments 
are made, however, the fact should be clearly stated and their magnitude 
should be indicated. 

The simplest way of dealing with non-response is to regard the non- 
respondents as similar to the remainder of the sample, i.e. to treat the sample 
as if it were a sample on a smaller number of units. With a stratified sample, 
the non-respondents in each stratum can be treated as the equivalent of 
respondents in that stratum." Alternatively some other appropriate classification 
can be used, as indicated above. 

If follow-up methods have been used and there has been a good response 
to the follow-up, initial non-respondents who subsequently respond can be 
treated as a sub-sample of all initial non-respondents and weighted accordingly. 
It is clear that if there is any difference between respondents and non-respondents 
the final non-respondents may be expected to be more like the initial non- 
respondents than the general population. This procedure was first, so far as 
I know, suggested by Professor D. V. Glass, and was used by him in the 
analysis of the Family Census (Section 4.10). 

In this survey those who failed to provide the enumerator with the required 
information were sent a letter further explaining the purposes of the survey 
and requesting that the form be sent direct to the Royal Commission. Of the 
930,000 initial non-respondents (ż.e. 17 per cent. of the whole sample), 50,000 
responded to this appeal. This 50,000 therefore constituted a sample (though 
a non-random one) of the 230,000, and the first 12,000 of the 50,000 replies 
were combined with the remainder of the sample with a weight of 230/12. 


* See Section 10.3 for a method of carrying out the adjustments in a large survey. 
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This procedure was found to give overall birth-rates which corresponded 
very closely to those already known from other sources, whereas the original 
sample gave birth-rates which were substantially too high, owing to the fact that 
the majority of the initial non-respondents were women with few or no children. 


5.23 Critical analysis of survey data 

At the outset a clear distinction must be made between the types of deduction 
that can be made with certainty from survey data and the types that are 
speculative. If in a nutrition survey, for example, we find that children of 
large families are worse fed than children of small families we can draw the 
definite conclusion that size of family is associated with malnutrition of the 
children, and we can give quantitative estimates of the degree of malnutrition 
actually existing amongst children of families of different sizes. We cannot, 
however, assert with certainty that size of family is the cause of this malnutrition, 
though the fact that in large families the income per head is automatically 
less if there is a fixed total income would lead us to expect an underlying causal 
relationship. 

Even in situations where a definite causal relationship is known to exist, 
deductions as to the magnitudes of the effects of given factors can never be 
made with certainty from survey data. We may, for instance, find that fields 
receiving fertilizers give higher yields per acre than fields without fertilizers. 
Yet we cannot attribute the observed differences solely to differences in fertilizers. 
The farmers using the fertilizers may be farming better land, they may be 
growing higher-yielding varieties, and they may be carrying out their farming 
operations with greater skill. ; r 

Clearly definable extraneous factors which may influence the estimates of 
the effects of other factors can be determined in the course of the survey. 
Under certain circumstances the disturbance due to them can be eliminated 
by rnethods of analysis which will be outlined in this and the following section. 
But there will always be other undetermined and possibly unascertainable 
factors which cannot be taken into account. x 

In order to determine with certainty the magnitude in the causal sense of 
the effect of any given factor, experiments must be undertaken. Surveys 
cannot be regarded as satisfactory substitutes for experiments. Nevertheless 
they are of value in situations in which experiments are difficult or impossible, 
though in such cases all conclusions must be tentative. They are also of er 
as a preliminary to experimental work, since they frequently indicate the 
factors that are likely to be most worth investigation. : bk 4 

If, however, survey data are to be effectively used for either of these two 
purposes it is important to have means of eliminating the effects of extraneous 


factors in so far as this is possible. 


A simple example will illustrate the problem involved. Table 5.23.a 


gives the numbers of fields, totals and means of yields per acre of a sample of 
901 potato fields classified according to (a) the five regions into which the 
country was divided, and (6) the five varieties included in the survey. These 
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results were obtained in the course of an investigation into the blackening of 
potatoes on cooking, the data on yields being collected from the farmers when 
the samples were taken, together with a considerable amount of information 
on fertilizers, cultural practices, etc. Approximately 180 fields were selected 
in each region. The selection within regions was not strictly random, but can 
be regarded as substantially so for the purpose of the present discussion. The 
sample was confined to the five named varieties, but was not stratified by 
varieties. 

From the results of Table 5.23.a it is apparent that the mean yield is highest 
for the Scottish region, and is also higher for the Northern region than for the 
remaining regions. There are, however, even larger varietal differences in 
yield. Consequently if the varieties were grown in different proportions in 
the different regions the regional differences are likely to be influenced by 
varietal differences. 

To examine this point it is necessary to construct the two-way classification, 
regions X varieties. This is shown in Table 5.23.b. The values of 
Table 5.23.a appear as marginal totals in this table. 


TABLE 5.23.a—POTATO SURVEY : NUMBERS OF FIELDS AND TOTALS AND MEANS 
> OF THE YIELDS PER ACRE (TONS) 


(a) Classified by regions 


No. Total Mean 
Scotland . B 174 1,482 
North $ s 177 1,425 
E. Midlands - 189 1,415 
South 5 . 182 1,324 7:27 
West . : . 179 1,368 7-64 
ALL . = 901 7,014 7-78 


| 


(b) Classified by varieties 


No. Total Mean 

Majestic . . 393 3,292 8-38 
King Edward . 250 1,563 6-25 
Great Scot . 56 461 8-23 
Arran Banner . 84 766 9-12 
Kerr's Pink © 118 932 7-90 
ALL os 901 7,014 7-78 
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TABLE 5.23.b—PoTATO SURVEY: TWO-WAY CLASSIFICATION OF THE DATA BY 
REGIONS AND VARIETIES 


Numbers of fields 


Scot. North E. Mid. South West Total 


Majestic . è 37 75 104 101 76 393 
King Edward . 42 l4 85 65 43 250 
Great Scot z 18 l4 — 6 \18 56 
Arran Banner . 8 38 — 9 29 84 
Kerr’s Pink A 69 36 = — 13 118 

TOTAL A 174 177 189 182 179 901 


Totals of yields per acre (tons) 


Scot. North E. Mid. South West Total 


| 
Majestic | 350 6l4 876 823 629 3,292 
King Edward . 321 80 539 387 236 1,563 
Great Scot $ 166 106 -- 49 140 461 
Arran Banner 73 351 = 65 277 766 
Kerr's Pink 572 274 — = 86 932 
1,482 1425 l4l5 1,824 1,368 7,014 


TOTAL 


Means of yields per acre (tons) 


Scot. North E. Mid. South West All 
Majestic 9-46 8-19 8-42 8-15 8:27 8-38 
King Edward . T65. 8.7 634 5:87 5-49 6.25 
Great Scot 3 9-22 7-57 — 8-17 7:78 8-23 
Arran Banner 9-12 9-24 — 7-22 9-54 9-12 
Kerr’s Pink 6-61 7:90 
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From the first part of this table (numbers of fields) it is apparent that the 
distribution of varieties is by no means the same for all regions. In particular 
very little King Edward, which yields about 2 tons per acre less than the other 
varieties, was grown in the Northern region. The relatively high yield of the 
Northern region is therefore accounted for, in part at least, by varietal 
differences. 

To make an estimate of what the differences between regions would be if 
the proportions of the different varieties were the same in all regions we can 
compare the regional means of the individual varieties. These are given in 
the last part of Table 5.23.b. Inspection shows that Scotland gives higher 
yields than every other region for all varieties except Arran Banner, of which 
there are only 8 fields in Scotland. The Northern region, on the other hand, 
does not show any consistent differences from the other English regions. 

Inspection of this kind may in certain cases be all that is necessary. 
Frequently, however, quantitative estimates of the differences attributable to 
one classification when freed from the effects of a second classification are 
required. Such estimates may be obtained in various ways, depending on the 
nature of the table and which differences are of interest. 


(1) Unweighted means of sub-class means 


Tf all the sub-class means are of adequate accuracy the marginal unweighted 
means of these means can be taken. These unweighted means will give 
estimates of the differences attributable to either classification when the units 
of each class are equally divided between the classes of the other classification. 

This method cannot be applied to the whole of Table 5.23.a because of 
the unoccupied cells, but it can be applied to the parts of the table represented 
by all varieties in the Scottish, Northern and Western regions, or all regions 
for varieties Majestic and King Edward. The results are shown in Table 5.23.c. 
For comparative purposes each set of means has been adjusted by adding a 
constant amount so that the mean of the set is equal to the general mean. 
The similarity of the Northern region with the other English regions is 
confirmed. The yield of Kerr’s Pink has also been reduced relative to the other 
varieties. This is a consequence of the high proportion of Kerr’s Pink in 
Scotland. 


(2) Standardization of proportions by weighted means of sub-class means 


This is similar to Method 1. The weighted means of columns (or rows) 
of the table of sub-class means are taken, with weights roughly in proportion 
to the numbers in each row (or column) class in the whole sample. 

An example of this method is given in Example 6.8. It is not applicable 
in full to Table 5.23.b on account of unoccupied cells, 

It should be noted that somewhat different quantities are estimated by the 
two methods. Failure to recognize this fact sometimes causes a certain amount 
of confusion. If applied to a varieties X regions table, for example, Method 1 
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estimates the differences between regions that would occur in the hypothetical 
situation in which equal numbers of fields of all varieties were grown in each 
region. Method 2 gives estimates appropriate to the hypothetical situation 
in which the different varieties are grown in the same proportions in all regions, 
these proportions being equal to the average proportions for the whole country. 
Only if the differences between the different varieties are the same for all 
regions will the estimated differences be the same. 

Method 2 has two advantages over Method 1. It is in general more 
accurate, since greater weight is on the average given to the cells containing 
the greater numbers of units. It also gives estimates which refer to a 
hypothetical situation more in conformity with that actually existing. 


TABLE 5.23.c—PoTATO SURVEY: UNWEIGHTED MEANS OF SUB-CLASS MEANS 
(a) omrrrinc E. MIDLANDS AND SOUTHERN REGIONS, (b) OMITTING GREAT 


Scor, ARRAN BANNER AND Kerr’s PINK 


Unadjusted Means adjusted 

means to sample mean 

(a) ©) (a) (6) 

Scotland 8-75 8-56 8-55 8-98 

North 7-66 6-95 7-46 7:37 

E. Midlands -= 7:38 — 7-80 

South — 7-01 — 7-43 

West 7-54 6-88 7-34 7:30 
Mean 7-98 7:36 7-78 
Majestic 8-64 8-50 8-44 
King Edward 6-28 6-21 6:08 
Great Scot 8-19 — 7-99 
Arran Banner . 9-30 = 9-10 
Kerr's Pink š -| 7-50 — 7:30 

Mean 7-98 7-36 7:78 7-78 


the marginal means of the sub-class means, 


whether weighted or unweighted, do not contain the whole of the information. 
If variety P yields more than variety Q in one region and less in another, for 
example, this fact can only be established from the sub-class means. Under 
Such circumstances any comparison of the regions based on equalization of 
the Proportions of the varieties represents an over-simplification of the real 


Situation, : 
Neither Method 1 nor Method 2 can be applies 1 a sie of a 

in whi č _ Even if there are no blank cells neither metho 
ich there are blank cells. Even in cells which contain so few 


will be v i i here are certai 
ery satisfactory when there | ; 
A 1 e very inaccurate. Thus in the 


units that the corresponding cell means ar 
Present example Diecrelitively large difference between Scotland and the 
Northern region in column (b) of Table 5.23.c, in contrast to column (a), 
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is in part due to the fact that Arran Banner has given a larger yield in the 
Northern ‘region than in Scotland. This may well be due to sampling errors, 
since there are only 8 fields of Arran Banner in Scotland. 


There are two further relatively simple methods which are of value in such 
circumstances. 


(3) Weighted means of differences of sub-class means 


Tf only two classes (cross-classified by a second classification into a number 
of sub-classes) are to be compared, a weighted mean of the differences of each 
pair of sub-classes can be taken. Maximum accuracy will be attained when 
the weights are inversely proportional to the squares of the standard errors 
of these differences, * 

With independent samples in each sub-class the square of the standard 
error of a difference is equal to the sum of the squares of the standard errors 
of the two means (Section 7.5). Under certain circumstances, which will 
be apparent from a study of Chapter 7, and in particular when the selection 
from within sub-classes is effectively random and the standard deviation per , 
unit is constant, the standard errors are inversely proportional to the square 
roots of the numbers of units in the sub-classes (Section 7.1). In this case 
the reciprocals of the weights must be taken proportional to the sums of the 
reciprocals of the pairs of sub-class numbers, i.e. if nı and n, are a pair of 
sub-class numbers the weight can be taken equal to w, where 


Vel il 


The calculations are shown in Table 5.23.d. The weight for Majestic, for 
example, is given by 1/7 = 1/37 + 1/75. Weights may be taken to the nearest 


TABLE 5.23.d—PoraTo SURVEY : ESTIMATE OF DIFFERENCE OF SCOTTISH AND 
NORTHERN REGIONS FROM WEIGHTED MEAN OF VARIETAL DIFFERENCES 


Difference Weight 

z w wz 
Majestic + 1:27 25 31-75 
King Edward + 1-94 10 a 19-40 
Great Scot . $ + 1-65 8 + 13-20 
Arran Banner : — 0-12 7 = 0-84 
Kerr’s Pink + 0:68 24 + 16-32 
e ey ee = 74 | 79- 
Weighted Mean . + 1-08 tangy 


pats het oa a T led ae || 
whole number. They can be rapidly calculated from a table of reciprocals 
or on a slide rule. The weighted mean, + 1-08, is obtained by dividing the 
total of wz by the total of w. 
* See Sections 9.6 and 9.7 for further examples of this procedure. 
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(4) Pooling of classes 

In cases in which inspection or use of the previous methods indicates that 
the differences between certain of the classes are small, such classes can be 
pooled. This often eliminates blanks and very small numbers in the sub-class 


table. 
In the present example inspection indicates that there is little difference 


between the four English regions. This is confirmed by the means in 
Table 5.23.c. These regions may therefore be pooled. This pooling will 
permit a better estimate of the differences between the last three varieties than 
that given by Table 5.23.c. After pooling Scotland can be included at 
$ weight, following Method 2. 

The calculations are shown in Table 5.23.e. It will be noted that the 
pooling can be effected by adding the numbers of fields and totals of yields 


for the four English regions from Table 5.23.b. 


TABLE 5.23.e—POTATO SURVEY: EFFECT OF POOLING ENGLISH REGIONS 


English regions (pooled) Scotlands | Weighted 

No. Total Mean Mean meat 
Majestic 356. 2,942 s26 | 9-46 8-50 
King Edward . 208 1,242 5:97 7-65 6-31 
Great Scot . 38 295 7-76 9.22 8-05 
Arran Banner . 76 693 9-12 9-12 9:12 
Kerr’s Pink 49 360 7-35 8:29 7-54 
ALL. a ee 5,532 761 | 832 7:79 
Weight . % rel 4 1 

ie a r S a 


5.24 Method of fitting constants oh, 
If there are more than two classes between which differences are required, 


Method 3 can be used to compare each pair separately. It will not, however, 


; timated differences 
iv i 5 je. the sum of the esi 
give a consistent set of estimates, | C will not exactly equal thn beween 


between 4 d between B an ; 
A and C. E À A Paean of the Ee taet ee ai re miyman 
if there are onl lasses in the table, 1s n : 5 fee rate 
Sense) when dae, be than two classes. m pe ET with B, 
comparisons between A and B, indirect comparisons © ot proportionate these 
etc., can be made, When the sub-class frequencies are not prop 


i R xtreme case, if variety 
Will contibute some, additional information- °° take an €: ; 
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P occurs in regions A and C only, and variety Q in regions B and C only, 
comparisons between regions A and B with differences between varieties 
eliminated can only be made by indirect comparisons involving region C. 

Estimates of maximum accuracy, which, as might be expected, are also 
consistent, can be obtained by fitting constants by the method of least squares 
(Yates, 1934, A ; Snedecor, 1934, A ; Snedecor and Cox, 1935, A). Snedecor 
has drawn attention to the fact that if the numbers of units in the 
different sub-classes are nearly proportionate, Method 2 can be used without 
appreciable loss of information in place of the more laborious method of 
fitting constants. 

Stevens (1948, A) has given a simple arithmetical method of obtaining the 
values of the estimates derived by fitting constants. The procedure, which is 


one of successive approximation, is illustrated for the data of our example 
in Table 5.24. 2 


TABLE 5,24—PoTATO SURVEY : ESTIMATION OF VARIETAL AND REGIONAL 
DIFFERENCES BY FITTING CONSTANTS 


Sc. N. EM. S. W. Starting values 
and corrections 


Final 
values 


| 
901 | 174 177 189 182 179| (1) (2) (3) | (5) 


Maj. 393 37 75 104 101 76 |+ -60 +-10 +-02 |+ +73 8-51 
K.E. 250 42 14 85 66 43 |—1-53 00 +--02 |—1-51 6-27 
G.S. 56 18 14 — 6 18 |+ -45 —-08 —-08 |4- 34 8-12 
A.B. 84 8 38 — 9 29 |+1-34 +-16 00 |+1-50 9-28 
K.P. 118 69 36 13 |+ -12 —-39 —-08 35 7-43 
| Nes 
(1) {4-74-4427 =29 —-51 —-14 | j 
Starting 2) |+-09 -48 +36 +-l4 16 |<—- 
values | 
and (3) |4-83 —-21 4-07 —37 —-30 P i| 
corrections (4) |+:13 -+-01 -06 —-06 U3 |< 
(5) |-+:03 +01 —-02 —02 -00 |% 
Final (6) |+:99 -19 -01 45 33 
values (7) | 8-77 7-59 7-77 733 7-45 


The number of fields in each sub-class, reproduced from Table 5.23.b, 
is shown in the body of the table, with marginal totals above and to the left. 
Column 1 and row 1 give the deviations of the varietal and regional mean 
yields from the general mean 7:78. These are obtained from Table 5.23.a 
or 5.23.b. Thus, for Majestic, 8-38 — 7-78 = + 0-60. These data are all 
that are necessary for the estimation process. 

The approximation should be started with the set of deviations which 
show the biggest differences, in this case the varietal deviations. We first 
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calculate what the regional deviations would be with these varietal deviations 
if there were no regional differences. These regional deviations are shown 
with signs reversed in row 2. To obtain them the numbers of fields in each 
column are multiplied in turn by the deviations in column 1, summed and 
divided by the total number of fields in each column. Thus for Scotland 
the deviation is 


(+ -60 x 37 — 1-53 x 42 + -45 x 18 + 1-34 x 8 + -12 x 69)/174 
= — 14-96/174 = — -09 


and similarly for the other columns. It is best to record the sums of products, 
— 14-96, etc., before division, as this provides a check against minor errors, 
the total of these sums of products being equal to the sum of the products of 
column 1 and the total column of numbers of fields. This sum, + 5-22, 
differs from zero only because of rounding-off errors. 

Since the observed deviation for Scotland is + -74, and the expected 
deviation if there were no regional differences is — -09, a first estimate of the 
true regional deviation for Scotland with varietal differences eliminated is 
+ +74 — (— 09) = + .83. These estimates are shown in row 3, which is 
obtained by adding row 2 to row 1. 

The varietal deviations in column 1, however, are themselves affected 
by regional differences. These may now be corrected for by the same process, 
using the estimates of the regional deviations just obtained. To do this the 
values of row 3 are multiplied in turn by the numbers of fields in each row, 
summed and divided by the total number of fields in the row. Thus for 


Majestic we obtain, after reversal of sign, the correction + 10. These 
corrections are shown in column 2. The checks operate as before. 

The corrections in column 2 could now be added to the values of column 1 
to give second approximations to the varietal differences. It is simpler, however, 
to use the corrections themselves to calculate corrections to the estimated 
regional differences of row 3. These are shown in row 4. The same procedure 
of calculation is followed, and the signs are reversed as before. A 

The values of row 4 are then used to give a second set of varietal corrections, 
which are shown in column 3, and these in turn are used to give a third set 

hown in row 5. 


of regional corrections, which are s ) j 
The process may be stopped at this point, since the corrections are now so 


i igi 1, 2 and 3 and rows 3, 4 
small that the next set will be negligible. Columns 1, anc ; 
and 5 are therefore summed to give the final estimates of the deviations, pioa 
in column 4 and row 6. These deviations may then be addet a oe a 

i esti varietal means freed trom 
mean 7-78 to give final estimates of the varieta ! l 
differences Gotten 5), and of regional means freed from varietal differences 
, . 
(row 7). Final checks are obtained by forming the su 
Means with the total numbers © 


f fields and dividing 
In each case the general mean 7-78 should be obtained. 
It will be seen that the final values do not d 


iffer greatly from the corresponding 
values of Tables 5.23.¢, 5.23.4 and 5.23.e, but they do differ substantially 
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from the values of Table 5.23.a. The differences between Scotland and the 
Northern region, for example, are as follows :— 


Regional means over all varieties (Table 5.23.a) sa 0-47 
Unweighted means of sub-class means (Table 5.23.c) E ii a 
Weighted differences (Table 5.23.d) ae a we 1:08) 
Fitting constants (Table 5.24) .. A re ne Se ER 


Equally the varietal means given in Table 5.23.e are very close to those of 
Table 5.24. 

This demonstrates that the more direct methods, used with judgment, 
are capable of giving satisfactory estimates. As is shown in Example 7.5, 
where the errors of these estimates are discussed, the third of the above regional 
differences, viz. 1-61, is decidedly less accurate than the others, since the 
information provided by the last three varieties is not taken into account. 
On the other hand, the agreement of the above estimates must not be taken 
as providing any indication of their real accuracy. Since all the estimates are 
based on the same data, any disagreement is primarily a reflection of the effects 
of the various approximations on the efficiency of the estimation process, 

There are some further general points in connection with the method of 
fitting constants which should be noted. 

The method will provide efficient estimates of the differences due to one 
classification, freed from the effects of the other classification, if the true 
differences are the same for all classes of the second classification. In the 
terminology of the design of experiments, this is equivalent to the non-existence 
of interactions. If the true differences vary markedly, the method is 
inappropriate. Instead the individual differences should be considered or 
Method 1 or Method 2 should be used. (See sub-section (2) above.) 

As mentioned above, if one of the classifications consists of two classes 
only, Method 3 is fully efficient for estimating the difference between these 
two classes. If estimates of the differences between the classes of the second 
classification are required, these may be derived by adjusting the class means 
in the manner followed in Table 5.24, assigning deviations of plus and minus 
half the difference to the two classes of the first classification. The method 
is in this case exact, and therefore no further approximations are required, 

The method can be extended to multiple classifications having three or 
more sets of classes. In this case the data required are the general means for 
each main classification together with the two-way tables of numbers of units 
corresponding to each pair of classifications. The numbers of units in the 
individual cells of the three- or more-way table are not required, If there 
are only two classifications the general means for each main classification 
together with the numbers of units in the individual sub-classes, are i 


R : required. 
The reporting of sub-class numbers of the relevant pairs of classifications 
should therefore be considered even in cases in which the reporting of the 
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separate sub-class means or the calculation of adjusted estimates is not considered 
worth while. If these numbers are available the various inter-relations may 
then be studied subsequently without further reference to the original material. 


5.25 Preparation of reports 


When the analysis of a census or survey has been completed it is usually 
necessary to embody the results in a report. In addition to the presentation 
of the numerical results in the form of tables and graphs, some discussion and 
interpretation is also required. 

The lines to be followed in the preparation of such reports vary greatly 
according to the nature of the material and the purposes of the report, but 
are in general similar for sample censuses and surveys and for complete censuses. 
and surveys. We therefore do not propose to discuss them here. 

There are, however, certain matters which should be reported on in a sample 
census or survey which do not arise (or are of less importance) in a complete 
census. ‘These matters have been covered by the memorandum (already 
referred to in Section 1.6) prepared by the United Nations Sub-Commission 
on Statistical Sampling, entitled Recommendations concerning the Preparation 
of Reports on Sampling Surveys. The recommendations* are as follows :— 


(1) General description of the survey 
scription of the survey should include information on the 


The general de: u I a 
following points. Some of these will require fuller treatment in the more 


detailed technical sections of the report. 

t of purposes of the survey. A general indication Bhouldane 

(a) Bee ate sue of the survey and ‘the ways in which it had been 

expected that the results would be utilized. 

ipti f the material covered. An exact description should be 

Mg Po cae geographical region and the categories of material covered 

by the survey. In a survey of a human population, for example, it is 

eens to specify whether such categories as hotel residents, institutions 

(e. u houses, sanatoriums), vagrants, military personnel, were 

ARA The reporter should guard against any possible mis- 
apprehension regarding the coverage of the survey. 

information collected. This should be reported in 

2 Si E including a statement of items of information 

AS ane not reported on. The inclusion of copies of the schedules 

fad relevant parts of the instructions used in the survey (including 

ial rules for coding and classifying) is often of value. If this is 

E ERN it may be possible to make available a limited number of 

copies which may be obtained on request. 


* The introduction to the memorandum is omitted. 
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(d) Method of collecting the data. Whether by interviewers, investigators, 
mail, etc. 


(e) Sampling method. An indication should be given in general terms of 
the type of sampling adopted, the size of the sample, the proportion 
it forms of the material covered, and arrangements for follow-ups, if 
any, in cases of non-response. 


Accuracy. A general indication of the accuracy attained should be 
y: g! Y. 
given. 


(g) Repetition. State whether the survey is an isolated one undertaken 
without intention of repetition, or is one of a series of similar surveys. 


(h) Point or period [of time]. Point or period of time to which the data 
refer. 


(i) Date and duration. The starting date and period taken for the field 
work. 


(j) Cost. An indication should be given of the cost of the survey, under 
such headings as preliminary work, field investigations, analysis, etc. 

(k) Responsibility. The name of the organization sponsoring the survey 
and of the one responsible for conducting it. 


(I) References. References should be given to any published reports or 
papers. 


(2) Design of the survey 
The [sampling] design of the survey should be carefully specified.* 


(3) Method of selecting sample-units} 


The reporter should describe the procedure used in selecting sample-units, 
and if it is not a random selection he should indicate the evidence on which 
he relies for adopting an alternative procedure. Purposive selection and quota 
sampling cannot be regarded as equivalent to random sampling. 


(4) Personnel and equipment 


It is desirable to give an account of the organization of the personnel 
employed in collecting, processing and tabulating the primary data, together 
with information regarding their previous training and experience. Arrange- 
ments for training, inspection, supervision, and methods of Processing data 
should be explained, as also should methods of checking the accuracy both of 
the primary data and of the processing. A brief mention may be made of the 
equipment used in processing the data. 

* A section on terminology follows. 


+ The first paragraph, defining random and systematic processes of selection, is 
omitted. j 
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The critical observations of the technicians in regard to any part of the 
survey should be given. These observations will help others to improve their 


operations. 


(5) Costs 

An important reason for the use of sampling (instead of complete 
enumeration) is lower cost. Information on costs is therefore of great interest. 
Costs should be classified so far as possible under such heads as preparation 
(showing separately the cost of pilot studies), field work, supervision, processing, 
analysis, and overhead costs. In addition, labour costs in man-weeks of 
different grades of staff, and also time required for interview and journey 
time and transport costs between interviews, should be given. The compilation 
of such information, although often inconvenient, is usually worth undertaking 
as it may suggest substantial economies in the planning of future surveys. 
Efficient design demands a knowledge of the various components of cost, as 


well as of the components of variance. 


(6) Accuracy of the survey 
(a) Precision as indicated by the random sampling errors deducible from 
the survey. Standard deviations of sample-units should be given in 


addition to such standard errors (of means, totals, etc.) as are of interest. 


The process of deducing these estimates of error should be made 


entirely clear. This process will depend intimately on the design 
of the sample survey. An analysis of the variances of the sampling-units 
into such components as appear to be of interest for the planning of 


future surveys is also of great value. 

(b) Degree of agreement observed between independent investigators 
covering the same material. Such comparison will be possible only 
when interpenetrating samples have been used, or checks have been 
imposed on part of the survey. It is only by these means that the 
survey can provide an objective test of possible personal equations 
(differential bias among the E ER a as 

{ - a eors AG) crore ew nich ares common o a 

2 aa no A any Bos: component of error (or “ bias”) 
in aha RSA information, will not be included in the estimates of 

deducible from the survey results. 


the random sampling errors : 
(ii) Another source of error of the same type 1S that due to observation 
of quantities which do not correspond exactly to the quantities of which 


estimates are required : in a crop-cutting survey, for campie a 
yields of the sample plots give estimates of n Tee: y ma e a 
in the standing crop, whereas the final yield k ea ee ae 
at harvest. (iii) The possible effects of „suc! errors Me L an 
of the results, and of incompleteness 10 the recor 4 e5 x Ea 
(e.g. non-response, lack of records, whether covering the whole o! 
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survey or particular areas or categories of the material), should therefore 
be fully discussed. (iv) Any special checks instituted to control and 
determine the magnitude of these errors should be described, and the 
results reported. 


(d) Accuracy, completeness and adequacy of the frame. The accuracy of 
the frame can and should be checked and corrected automatically in 
the course of the enquiry, and such checks afford useful guidance for 
the future. Its completeness and adequacy cannot be judged by internal 
evidence alone. Thus complete omission of a geographic region or 
the complete or partial omission of any particular class of the material 
intended to be covered cannot be discovered by the enquiry itself, 
and auxiliary investigations have often to be made. These should be 
put on record, indicating the extent of inaccuracy which may be 
ascribable to such defects. 

{e) Comparison with other sources of information. Every reasonable 
effort should be made to provide outside comparisons with other 
sources of information. Such comparisons should be reported along 
with the other results, and the significant differences should be discussed, 
The object of this is not to throw light on the sampling error—since 
a well-designed survey provides adequate internal estimates of such 
errors—but rather to gain knowledge of biases, and other non-random 
errors. 

(f) Efficiency. The results of a survey often provide information which 
enables investigations to be made on the efficiency of the sampling 
designs, in relation to other sampling designs which might have been 
used in the suryey. The results of any such investigations should be 
reported. To be fully relevant the relative costs of the different 
sampling methods must be taken into account when assessing the 
relative efficiency of different designs and intensities of sampling. 
Such an investigation can be extended to consideration of the relation 
between the cost of carrying out surveys of different levels of accuracy 
and the losses resulting from errors in the estimates provided. This 
provides a basis for determining whether the survey was fully adequate 
for its purpose, or whether future surveys should be planned to give 
results of higher or lower accuracy. 


CHAPTER 6 


ESTIMATION OF THE POPULATION VALUES 


6.1 Possibility of alternative estimates 


In this chapter we shall deal with the derivation of estimates of the 
population values from the numerical results obtained in the sampling. A 
simple example of such an estimate is provided by the arithmetic mean of the 
sample values of a random sample. It is well known that this mean provides 
an estimate of the mean of the population from which the sample was drawn, 
though it will not, owing to sampling errors, be exactly equal to the mean 
of the population. 

The arithmetic mean of the sample values is not the only possible estimate 
of the population mean. We might, for instance, take the median, i.e. the 
central value, or the geometric mean, i.e. the antilogarithm of the mean of the 
logarithms of the sample values, or even the mean of the highest and lowest 
values in the sample. 

In addition to estimates such as the mean and the median, which can be 
derived from a given set of values independently of any supplementary 
information associated with these values, there are further alternative estimates 
which can be derived by taking account of such supplementary information 
as is available, either qualitative or quantitative. Thus, as has been mentioned 
in Section 3.3, if the numbers of units from the whole population falling in 
the different strata of some stratification are known, a random sample can be 
adjusted so that the different strata are represented in their correct proportions. 
Similarly, supplementary information on a quantitative character can be used 
in various ways to provide estimates which will in general be more accurate 
than the simpler estimates which do not utilize this information. ; 

In deciding which is the best estimate for any given type of sampling three 
different criteria have to be considered. These are, absence of bias, accuracy 
(or, as it is technically known, efficiency), and computational convenience. 
In the case of a random sample, if the population values are normally 
distributed—the meaning of this term is _explained in Chapter 7—the 
arithmetic mean will provide an estimate which is both free from bias and, 
apart from supplementary information, of maximum accuracy. It is also 
sufficiently simple computationally for practical use. More important, the 
mean will remain an unbiased estimate of the population mean whatever the 


f . distribution of the population values, though it will not necessarily 
orm of the distribution of the pop id be devised. ‘The mean also has the 


be the most accurate estimate that cou a ae 

incidental advantage that the sampling errors to which it is subject can be 

relatively easily assessed, and are not greatly dependent on the form of the 

distribution of the population values. : 

In this book waa not propose to do more than list what appear to be 

the most useful estimates for any given type of sampling, and give examples 
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of the computations involved. Any discussion of questions of bias and relative 
efficiency requires advanced mathematical statistical theory. In general the 
recommended estimates are free from any important source of bias ; where 
this is not the case the circumstances in which bias can arise are indicated. 

Estimates are required not only for the whole population, but frequently 
also for the different parts of it which constitute domains of study. The 
formule of estimation which are applicable to the whole population are in 
general applicable also to the separate domains, and need not be discussed 
separately. In certain cases adjustments which can be applied to the population 
estimates cannot be applied to the estimates for the different domains. Thus 
if the population mean of a supplementary variate is known, but not the means 
for the different domains of study, adjustment by means of supplementary 
information can be applied only to the estimates of the whole population. In 
like manner the gain in accuracy due to stratification is sometimes different 
for the population and domain estimates. If the domains cut across the strata, 
for example, the errors of the domain estimates may only be slightly reduced 
by the stratification. 


6.2 Notation 


It is-important, both in discussion of the problem of estimation and in 
the mathematical formula, to make a clear distinction between estimates of 
the population values and the population values themselves. The present 
convention in mathematical statistics is to denote the population parameters 
by Greek letters and the corresponding estimates by the corresponding Latin 
letters. This convention, however, is difficult to apply consistently, and is 
in any case more appropriate for infinite hypothetical populations than for 
the finite populations met with in sampling. In the present manual we have 
for the most part adopted the convention of denoting the population values 
by bold type, the corresponding estimates of these values by Gill Sans type, 
and values appertaining to the selected sampling units by ordinary italic type. 
Thus, with a quantitative character or variate y, the values for the selected 
sampling units will be denoted by y (with or without suffices as necessary), 
the mean of these values will be denoted by 7 (following the ordinary conyention), 
the estimate of the mean of the population by y, and the true mean of the 
population by y. With a random sample we shall have y = f, but y differs 
from ¥ by the sampling error. Totals for the population are indicated by 
capitals, summation over the sample values by S, and summation over the 
different strata by 2. 

In certain types of estimation we shall be concerned with the use of 
supplementary information, such as size of unit, which is known not only for 
the selected sampling units but also for the whole of the population, or in 
the case of two-phase sampling, for the larger number of units selected at 
the first phase. A variate representing quantitative supplementary information 
will be denoted by x. d 
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; Even when information on a variate is not available for the whole population, 
it may be necessary to make an estimate of the population values for some 
standardized value of this variate. The letter x will also be used to denote 
a variate of this type. 


The following is a list of the principal symbols employed in this chapter :— 


b, estimated regression coefficient. 
Jf, working sampling fraction. 
f, exact sampling fraction. 
g, working raising factor (= 1/f). 
g, exact raising factor (= 1/f). 
i (suffix), denotes values belonging to a particular stratum 7. 
n, number of units in the sample. 
N, N, number of units in the population, and its estimate. 
p, p, proportion of units in the population possessing a given 
attribute, and its estimate. 
r, ratio y/x. 
z, F, ratio Y/X, and its estimate. 
S, summation over the units of the sample. 
Si, summation within stratum i. 
X, summation over the strata. 
u, number of units in the sample possessing a given attribute. 
U, U, number of units in the population possessing a given attribute, 
and its estimate. 
x, supplementary quantitative va ; E 
X, X, total of x for the population, and its estimate. 
ariate under investigation. 
population, and its estimate. 


riate, such as size of unit. 


y, quantitative V 


Y, Y, total of y for the 
7, &, 7, means of r, x, Y, for the sample. 
x, y, x, y, means of x and y for the population, and their estimates. 


6.3 General rules 


There are certain fundamental rul 
of sampling. These are >— 


es of estimation which apply to all types 


quantitative variale 


tion multiply all sample values by 
rocals of the sampling fractions) 


Rule 1—The population total of @ 
To estimate totals for the populatior 
their raising factors (equal to the recip 
and sum the raised results. 


Rule 2—Number of units in the population 
To estimate the number of sampling units in the population follow 
Rule 1, scoring each selected sampling unit as 1. 
147 


SECT. 6.4 SAMPLING METHODS FOR CENSUSES AND SURVEYS 


Rule 3—The population mean of a quantitative variate 


Divide the estimated total of the variate for the population by the 
estimated number of units in the population. 


Rule 4—Proportion (or percentage) of units possessing a given qualitative character 
Proceed as for a quantitative variate, scoring all units possessing the 
given character as 1 and all others as 0. Divide the estimated total score 
by the estimated number of units in the population. 


Rule 5—Ratio of two quantitative variates 


Estimate the totals of the two quantitative variates for the population 
by Rule 1 and take the ratio of these totals. (Rules 3 and 4 are special 
cases of Rule 5.) 


In cases in which the probability of selection of all units is the same 
(uniform sampling fraction or, in the case of multi-stage sampling, uniform 
overall sampling fraction), the first four rules can be condensed into the simple 
general rule that means and proportions in the population are estimated by 
the corresponding means and proportions in the sample, and totals and 
numbers in the population are estimated by multiplying the corresponding 
totals and numbers by the common raising factor. 

The above rules cover most of the methods of estimation discussed in this 
manual except those involving regression, which cannot easily be summarised 
in simple rules. They give rise to the formule of estimation set out in the 
following sections of this chapter. 


6.4 Random sample 
Number : 
N =gn 
N will be equal to N except for minor discrepancies due to the use of a 
working sampling fraction which does not give an integral number of sampling 


units. If N is known then the true sampling fraction f equals n/N and the 
true raising factor g equals N/n. 
Mean of a quantitative variate : 


x Sal 
ISO) (6.4.a) 
Total of a quantitative variate : 

Y =g S(y) = Ny 


or more accurately, if N is known, and differs from N, 


Y =g S(y) = Ny (6.4.b) 
Proportion possessing a given attribute : 
u 
Pie 
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Number possessing a given attribute : 
U = gu = Np 
or, more accurately, 
U’ = gu = Np (6.4.c) 
The same formulæ of estimation will hold for systematic samples from 
lists, etc. 


Example 6.4.a 


In a housing survey of a town a systematic sample from a list of all houses 
was taken with a sampling fraction of 1/50. 627 houses out of a total of 8491 
in the sample were classified as defective. What is the estimated number and 
percentage of defective houses in the town ? 

2) 


627 
Percentage defective = 100 p = 100 x 3491 = 7-38 per cent. 


Total number defective = U = 50 x 627 = 31,350 


Example 6.4.6 


If the values in Tabl 
sample of 20 objects, selected from a batc 


e 6.4 are taken to represent measurements on a random 
h of such objects with a sampling 


TABLE 6.4—SAMPLE OF 20 MEASUREMENTS 


6-2 8-0 8-2 11-0 
13-8 12-0 “8-7 10-3 
8-0 10-7 8-5 14-6 
7-6 9-1 10-1 8-0 
10:3 10-4 9-3 9-0 


fraction of 1/25, estimate the mean measurement of the batch, and the total 
, 


of all the measurements of the batch. 
N = 25 X 20 = 500 
S(y) = 193-8 
1 
— = x 193:8 = 9-69 
20 * 


ï 
y = 25 x 193-8 = 4845 


If the number in the batch is known to be 507, a slightly more accurate 


estimate of the total is 
Y’ = 507 X 969 = 4913 
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6.5 Stratified sample with uniform sampling fraction 


The formule for a random sample hold, except that if the numbers in the 
different strata N; are known, and differ from Ni, the formula 6.4.b is 
replaced by 


Y’' =Z(Niyi) (6.5.a) 


U’ = = (Ni pì (6.5.b) 


with corresponding slight increases in accuracy in ¥ and p, if they are derived 
from these estimates by division by N. 


and the formula 6.4.c by 


Example 6.5 


Table 6.5.a shows the wheat acreages of the stratified random sample of 
1 in 20 Hertfordshire farms described in Section 3.7. Estimate the total 
wheat acreage of the county and the mean acreage of wheat per farm (a) from 
the data of the sample alone, (b) given the total number of farms in each 
size-group. Estimate also the number of farms growing wheat. 


TABLE 6.5.a—HERTFORDSHIRE FARMS, 1939: ACREAGES OF WHEAT IN A 
STRATIFIED RANDOM SAMPLE OF 1 IN 20 FARMS (STRATIFIED BY ACREAGES 
OF CROPS AND GRASS) 


Size-group . 3 4 5 6 

Acres . a 21-50 51-150 151-300 301- 

No. of farms 18 26 20 13 
| 

E Pii 49 19 20 66 | 72 

0 5 10 l4 | 24 18 92 

0 0 27 4 | 30 17 69 

0 0 33 0: UDA. 32 78 

0 0 4 12 17 71 51 

0 9 80". 18 i S70) . 48 84 

0 0 0 0 80 70 0 

8 0 0 16 0 62 102 

5 0 13 28 36 One 13 

0 5 0 0 92 

27 23 158 

10 22 62 

24 3 0 

TOTAL 35 386 710 | 873 
| 


Size-group 1l (l-5 acres): 22 farms, no wheat. 


Size-group 2 (6-20 acres): 26 farms, 7 acres of wheat on 1 farm. 
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The results are summarized in Table 6.5.b. The estimate of the total 
area of wheat in the county from the sample is 


Y = 20 x 2011 = 40,220 acres 


The mean area of wheat per farm is 
f = 2011/125 = 16-1 acres 


The number of farms growing wheat is 
U = 20 x 54 = 1080 


TABLE 6.5.b—SUMMARY OF SAMPLE OF TABLE 6.5.a 


| ee I Wheat acreage 
Size- No. of in sample in sample No. of Total 
group, farms _farms | wheat 
acres in sample P in county | acreage 
No. | po SET Total | Mean 
peee 
1-5 22 0 -000 0 0-0 435 0 
6-20 26 1 +038 7 0:3 519 160 
21-50 18 5 1278 35 1-9 357 680 
51-150 26 21 -808 386 14-8 519 7,680 
| 151-300 20 16 -800 710 35:5 400 14,200 
301- 13 11 -846 873 | 672 266 17,880 
th asl 
ALL 125 54 -432 2,011 16-1 2,496 40,600 


group is known, the estimate of 


farms in each size- 

htly more accuracy by using the size-group means, 
jumns. This gives an estimate Y’ of the total 
d a mean area pet farm y’ of 40,600/2496 = 16:3 
here quite trivial, since the variation within 
the mean of that stratum. 

timated similarly from the 


If the total number of 
Y can be calculated with slig 
as shown in the last three CO 
area of wheat of 40,600 acres, an 
acres. The gain in accuracy is 
each stratum is large relative to i 

The number of farms growing wheat can be es 
proportions in the size-gtouPS giving 


Ur a0 x 495 + 0-088 x 519 + a 


= 1083 


is trivial. 


Again the gain in accuracy 
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6.6 Random sampie, stratified after selection 


The means of, or proportions in, the different strata must be calculated 
separately, and formule 6.5.a and 6.5.b used, with division by N for estimates 
of ¥ and p. 


Example 6.6 

Table 6.6.a shows the data, including acreages of crops and grass, for 
the random sample of 1 in 20 Hertfordshire farms described in Section 3.7. 
Estimate the total area of wheat and the number of farms growing wheat 
(a) directly from the sample, (b) by stratification by size, given the total numbers 
of farms in the size-groups of Table 6.5.b. 


TABLE 6.6.a—HERTFORDSHIRE FARMS, 1939: ACREAGES OF CROPS AND GRASS 
(lst COLUMN), AND OF WHEAT (2ND COLUMN), OF A RANDOM SAMPLE OF 
1 IN 20 FARMS (CLASSIFIED BY DISTRICTS AFTER SELECTION) 


District 1 District 3 District 4 | District 5 District 6 
15 farms 40 farms 24 farms 4 farms 24 farms 
188 16 | 370 67 40 0 11 0 4 0 8 0 
60 ‘OF i261 0 28 0| 6 0| 312 102 87 l4 
192 0 1:369. 58| 221 59 543 80 8 0 6 0 
48 0 | 212 45 31 0 822 265) 1l 0 44 0 
44 0 | 153 20 6 0 654 112 4 0 
79 33 | 287 44 34 0 3 0 335 102 614 2 
14 0 28 0 316 75 158 50 | 192 20 
465 92 l4 0 116 33 4 0 10 0 
197 0 4 0 4 0 68 27 District 7 24 0 
163 0 17 0 409 102 55 12 2 0 
198 0 2 0 6 0 4 0 10 farms 9 0 
78 0 3 0 115 0 2 0 3 0 
6 0 7 0 19 0 192 24 128 5 2 0 
35 0 6 0 274 6 4 0 4 0 120 24 
168 0 | 335 82 3 0 491 24 46 0 58 0 
— 4 0 144 0 224 28 181 20 20 0 
1,935 141 1 0 3 0 | 280 75 17 0 30 0 
4 0 482 62 90 0 24 0 197 6 
180 0 | 156 28 3 0 10 0 l4 3 
District 2 120 11| 302 71 3 0 36 0 32 6 
— 6 l 12 0 2 0 
8 farms 4,851 763 4 0 89 0 285 29 
161 80 — 138 0 
8 0 246 60 547 25 126 0 
294 29 == — 
597 107 4,034 837 2027 174 
8 0 ee ee 
2 a GRAND TOTAL, 
200 5 125 farms :—| 15,114 
14 0 2,301 
262 58 | 
1,385 259 


on 
i) 
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TABLE 6.6.b—HERTFORDSHIRE FARMS, 1939: ESTIMATION OF WHEAT ACREAGE 
FROM THE RANDOM SAMPLE OF l IN 20 FARMS (TABLE 6.6.a) STRATIFIED BY 
SIZE-GROUPS AFTER SELECTION 


Farms Acreage 
Size- with wheat of wheat No. of Toul 
group No. in | farms for 
acres sample x Po | in county | county 
No. portion Total Mean 
1-5 25 0 0 0 0 435 0 
6-20 26 1 038 3 0-1 519 50 
21-50 16 1 062 0'4 357 140 
51-150 17 8 471 159 9-4 519 4,880 
151-300 26 20 769 | 762 29-3 400 11,720 
301- 15 15 1-000 1,371 | 91-4 266 24,310 
ELN 
125 45 +360 2,301 18-4 2,496 41,100 


at = 2301 x 20 = 46,020 acres. 


(a) Total area of whe: E 


Number of farms growing wheat 
(b) Classifying the data by size-groups (crops and grass) the numbers and 
totals shown in Table 6.6.b are obtained. ¢ The mean wheat acreage 
is then calculated for each size-group, multiplied by the total number 
of farms in that size-group, and the products summed, giving an 
estimated total wheat acreage of 41,100. Similarly, using the proportion 
of farms with wheat instead of the mean acreage for each size-group, 
the estimated number of farms growing wheat is found to be 


-038 x 519 + 062 x 357+ .. - = 860. 


6.7 Stratified sample (variable sampling fraction) 


N = D (gini) 

Y = X (gi Si (9) 
7 =YIN 

U = X (gi ti) 

p = U/N 


native formulæ 6.5.a and 6.5.b, with division 


If the N; are known, the alter: 
rate. 


by N for y and p, are slightly more accu! 
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Example 6.7 


Table 6.7.a shows the Hertfordshire farm data for the stratified systematic 
sample with a variable sampling fraction described in Section 3.7. Estimate 
the total wheat acreage and the number of farms growing wheat. 


TABLE 6.7.a—HERTFORDSHIRE FARMS, 1939: STRATIFIED SYSTEMATIC SAMPLE 
OF WHEAT ACREAGES, WITH A VARIABLE SAMPLING FRACTION (CLASSIFIED 
BY DISTRICTS) 


Size-group: | 1-5 | 6-20 |21-50 | 51-150 | 151-300 301-500 501- 
Sampling 
fraction : | Nil | 1/200 | 1/60 1/20 1/10 1/5 1/3 
No. in 
sample: | 0 3 6 26 40 43 17 
District 
1 -| — 0 0 30/17 18 0ļ|172 Oo 92|114 
| 6 28 56 | 
2 — 0 10 O 0|30 16 50| 50 49 121/119 
40 55 62 os 72 100 | 107 | 
| 186 124 105 | 101 
| 104 160 
3 —|— O25) BI (0 10) nA 67 22 Psion 
0 |10 O} O 41 42] 58 75 51/120 
0 0 24 25| 78 94 126 
61 86 97 
4 — 0 17 | 28 24|42 o 24| 88 65 58|268 260 
8 0 | 54 60 75| 94 115 98) 265 260 
5 44 6 |121 80 92|112 155 
18 40 120 |240 168 
209 
5 -| —= — 0 0 31 32 | 66 142 26) — 
| 0 | 
6 — 0 0 OF 0/88 0 10 0 72 
0 0| 66 17 29 0 
19 
7 = = — 0 0} 0 60 16 — 
l4 
| 
TOTAL | — 0 27 214 1,163 3,292 2,925 
| 


The calculations are shown in Table 6.7.b. They follow the same lines 
as before, except that the sample total for each stratum must be raised 
separately. Using the working sampling fractions we obtain estimates of 
42,765 acres of wheat and 911 farms growing wheat. 
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TABLE 6.7.b—SUMMARY OF SAMPLE OF TABLE 6.7.a 


Size- No. No. LA | Raised totals Mean 
group in with Tou Ranink E pacteage 
acres sample | wheat ed actor per 
| No. Acreage farm 
1-5 0 = = = = a = 
6-20 3 0 0 200 0 0 0 
21-50 6 2 27 60 120 1,620 4-5 
51-150 26 12 214 20 240 4,280 8-2 
151-300 40 30 5 10 300 11,630 29-1 
301-500 43 40 5 200 16,460 76-6 
50l- 17 17 3 51 8,775 172-1 
| 
| \ | 
| 911 42,765 | 
i i 


6.8 Use of supplementary information in estimation 


As already indicated, supplementary information on a quantitative character, 
the values of which are known for all the units of the population, can be used 
as the basis of stratification, Or for the adjustment of an unstratified sample 
by stratification after selection. Alternatively, as mentioned in Section 2.8, 


such information can be used directly without stratification. Two methods, 


the ratio method and the regression method, are available. In either case 
only the total or mea 


n of the supplementary variate for the whole population 
need be known (in a for the selected sampling units). 
The ratio method is $ 


ddition to the values 
impler computationally, but the regression method is 
in certain circumstances more accurate. 

In the ratio method, the ratio of ¥/X in the population is estimated from 
the sample, the estimated ratio being multiplied by the total X of x for the 
population to give the estimated total Y of y for the population. The method 
t be such that bias is avoided. As already explained in 
Section 2.6, the appropriate estimate of the ratio for a random sample is 
S(y)/S (x) or FIF More generally, Rule 5 of Section 6.3 will give an 
eparate values of the ratio may 


unbiased estimate, though in certain cases St i € ` 
be estimated for the different strata as described in Sections 6.10 and 6.11." 


In the regression method the average change of y for unit change of * 
(known as the regression coefficient) is estimated, and this coefficient is used to 


adjust the sample results for any discrepancy between the mean size of unit 
in the sample and in the po 


pulation. 
* An extension of the ratio method, the double ratio estimate, is described in Section 
10.5. E 
155 


SECT. 6.8 SAMPLING METHODS FOR CENSUSES AND SURVEYS 


The contrast between the ratio and regression methods is illustrated in 
Fig. 6.8. The data plotted are those of Table 6.12. The dots represent 
the x and y values of the. sample points, the sample mean (2, F) being M. 
Q’ represents the known population mean = of the supplementary variate, 
which differs from # by QO’. The line OMD through the origin and the 
mean represents the ratio ¥/# given by the sample, and the ordinate RO 
of the point P, on this line, equal to (7/2) ¥, gives the adjusted estimate of 
the population mean by the ratio method. The regression line AMB also 
passes through the mean, and has a slope b equal to the regression coefficient. 


300 = 7 
250) 8 
$200 
we a e 
= 
z 
w 150 
z 
= 
Ss e 
5 
> 100) s 
s 
2 
= 
5 ° 
Ss 
> 


— -t ee 1 1 
(o 50 100 Q Q- 150 200 250 300 
VOLUMES, æ, OF CORRESPONDING STANDS (cu.ft. PER Vio ACRE) 


Fic. 6.8—USE OF SUPPLEMENTARY INFORMATION : RATIO AND REGRESSION METHODS 
(DATA oF TABLE 6.12) 


This line has the property that the sum of the squares of the vertical distances 
of the sample points from it is minimum. The adjusted estimate by the 
regression method is given by the ordinate P,Q’ of the point P., and equals 
7+ b (3 — 3). 

7 a regression method therefore differs from the ratio method in that in 
the former the straight line which best fits the sample values is taken, whereas 
in the latter the line through the origin is taken. When the supplementary 
variate æ represents size of unit, the true regression line generally passes 
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through the origin, though curvature of this line may result in the best-fitting 
straight regression line not passing through, or even very close to, the origin. 
Nevertheless, in most census work in which « represents size, or some variate 
closely correlated with size, the greater simplicity of the ratio method outweighs 
any small gain in accuracy resulting from the use of regression. 

It may be noted that in large samples the regression line can be plotted 
by grouping the data according to the « values ard plotting the means of y 
for the different groups. 

The formulæ for regression have been included in this book, not because 
it is expected they will be very commonly used in census work, but because 
the regression method represents an important part of sampling procedure, 
without which no account of sampling methods would be complete, and 
because the calculation of the sampling errors to which a balanced sample 
is subject can only be made by use of the regression concept. 

If the population mean of x is estimated from observations at the first 
phase of two-phase sampling, the observations for y being obtained at the 
second phase, the same formule of estimation hold, the estimate x, being 
substituted for ¥.* If, however, the sampling is single-phase, and the estimate # 
from the sample is substituted for ¥, the formula S ï appropriate to a 
random sample without supplementary information 1s obtained. In other 
words, there is no gain in accuracy in the population estimate unless x is 
known or alternatively is estimated from a larger sample than is available for y. 

In addition to their use in the adjustment of the population estimate of ‘the 
mean or total of y when the mean or total of w is known for the population, 
or is determined from the first phase of two-phase sampling, ratios and 
regressions are of use for the purpose of obtaining estimates of the means of 
y for some standardized value of x. Hence comparisons of different parts of 
the population can be made, freed from the effects of variation in the average 
values of x. In the case of ratios this is equivalent to comparing ae values 


iti ber of sheep per 100 acres ins l t p 
quantities as numi Tatanane oh 


i imi dization to 
farm. ssion enables similar standardizatio ; 
m. Regre in a nutrition survey, for example, it may 


. isi ropriate ; 

SS ae metea eae of malnutrition varies with size of family, but 
the relation will not be proportionate. In large-scale aaan however, 
standardization of this type can equally well be made by using j e size-group 
means, thus avoiding the trouble of calculating panon me ae 4 

The formule in the following sections are given for a auan i a uate 
Formule for the proportion of units possessing a pane attril as Ci noe ee : 
by scoring each unit as 1 if it has the attribute Sea abet x 
supplementary variate * rep SRR es oe ee i x will be scored wit it 
unit size may be required, in which case a unit of size 
has the attribute, and zero otherwise. 


* Note that the observations on # fore ae 
of the first-phase information, and mus 
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when calculating this estimate. 


SECT. 6.8 SAMPLING METHODS FOR CENSUSES AND SURVEYS 


Example 6.8 

The data of Table 6.8 are extracted from the Report of the National Farm 
Survey of England and Wales (Ministry of Agriculture and Fisheries, 1946, G). 
They give the rents of holdings per acre of crops and grass, classified by size- 
groups, in Berkshire and Cornwall. Calculate rents per acre standardized for 
size of holding, in the proportions in which the different size-groups occur 
in the whole country. 


TABLE 6.8—RENTS PER ACRE FOR BERKSHIRE AND CORNWALL 


3 Rent per acre (shillings) Proportionate'areas. oí 
E different size-groups 
Berkshire Cornwall in whole country 
5-25 53 55 1 
25-100 | 32 i 31 i 8 
100-300 | 24 | 22 | 10 
300-700 20 18 4 
700- 17 14 | 1 
| 
| 
Overall 23 28 99 


The proportionate areas for the whole country are shown in the last column. 
The standardized rent for Berkshire, using these areas as weights, is 
(1 x 53+ 6 x 32+ .. .)/22 = 26 shillings per acre, and that for Cornwall 
is 25 shillings per acre. 

Inspection of the table shows that although the overall rent per acre is 
considerably less for Berkshire than for Cornwall, there is little difference 
between the two counties for the different size-groups, the rents for Berkshire 
being in general somewhat greater than those for Cornwall. This is brought 
out, in a single contrast, by the standardized rents. The lower overall rent 
per acre for Berkshire is in a certain sense accounted for by the greater 
proportion of large farms in that county. 

This example illustrates both the use and the danger of standardization of 
this type. The standardized rents eliminate the effects of differences in average 
size on the average rent, which in so far as they are due to greater concentration 
of buildings on the smaller holdings, greater demand for smaller holdings, 
etc., do not represent differences in value of the land. It would be incorrect, 

however, to assume that were the size-distributions of farms in the two counties 
the same the overall rents per acre would be the same. Part of the difference 
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is due to the tendency of poorer land to be farmed in larger units, and such 
land would not command the full increase in rent which is apparently attracted 
by smaller farms if it were divided into smaller units. 


6.9 Ratio method: random sample 


WOE ay 

ache (6.9) 
S (y) 

Y=7@* 

z S (y) 

~ S$ (x) 


The formula for y may be used for obtaining the “ standardized ” value 
Yo of y for a standard value x, of x, or for estimation in two-phase sampling 
using an estimate X, obtained from first-phase information. 


Example 6.9.a 
Estimate the total area of wheat 
ratio method, given that the total a 
273,074 acres. 
Crops and grass in sample = S (x) = 15,114 acres 
Wheat in sample = S (y) = 2301 acres 


from the data of Example 6.6 by the 
rea of crops and grass in the county is 


Estimate of wheat acreage in county = 75474 
= 0°15224 X 273,074 
= 41,570 acres. 


ae belongi 43 kraals which fi 
ives the numbers of persons be longing to raals whic! ‘orm 
a Be E " R in the Mondora Reserve in Southern 


f the 325 kraals i 
Rhodesia, and also the numbers of persons absent from these kraals. (The 


TABLE 6.9—DATA FROM A RANDOM SAMPLE OF 43 KRAALS: TOTAL NUMBER 
OF PERSONS (INCLUDING ABSENTEES), X, AND NUMBER OF ABSENTEES, ) 


x y x y s y 

T. B 89 7 75 19 159 38 
3 M 57 2 69 16 5i 26 
30 6 132 26 63 9 69 wy gd 
45 3 47 i 83 14 61 fa 
28 5 gh lit 124 2% 164 69 
ua i5 iio à 2% 31 3 2 i 
125 18 6 l6 OB. oa a 5 
81 9 103 18 42 A 38 a0 
43 12 52 16 8538 3 1 
53 4 67.7 ae 28 5l 9 
14881 eee is 3,427 799 
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data form part of the results of a sample census of the Hartley District, and 
have kindly been made available by Dr. J. R. H. Shaul.) Estimate the 
percentage of persons absent from the reserve, and the numbers of persons 
belonging to the reserve and absent from the reserve. 

799 


Percentage absent = 100 F 3407 * 100 = 23-3 per cent. 


325 
Number belonging to reserve = X = B* 3427 = 25,902. 


32 
Number absent from reserve = Y = = x 799 = 6039. 
o 


Number present in reserve = 25,902 — 6039 = 19,863. 
This example, though superficially similar to Example 6.4.a, is structurally 


different, in that the sampling units consist of kraals and not individuals. 
This affects the estimation of the sampling error (see Section 7.8). 


6.10 Ratio method : stratified sample with uniform sampling fraction 


(a) When the ratio is assumed to be the same for all strata: 
The formule for a random sample hold. 


(b) When the ratio is permitted to assume different values for the different 
strata : 


Treat each stratum separately, using the formule for a random sample, 
and build up the population estimates by summation of the estimates for 
the separate strata, with division by N or N for the population means. 


This gives 


Si (y) 
Ye ( X ý 
che :) Í (6.10) 
etc. 
The choice between method (a) and method (b) depends on: 


(1) Numbers in the different strata—method (b) can only be used if the 
numbers of units from the individual strata are sufficiently large to 
give reasonably accurate determinations of the values of ratio for the 
separate strata: if the numbers are small and there is correlation 
between 7 and x, the method will be biased. (This objection does not 
hold if selection with probabilities Proportional to x is used—see 
Section 3.10.) 


(2) The degree of variation in the ratio between the different strata—the 
greater this variation the greater will be the gain in accuracy by the 
use of method (b) ; if the variation is small, method (a) may be the 
more accurate. A 
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(3) Computational convenience—method (a) is simpler, since only one 
value of the ratio is involved. 


Method (b) can also be used for a random sample stratified after selection, 
provided the population totals of x for each stratum are known. 


Example 6.10 
Estimate the total area of wheat from the data of Example 6.6 by the 
ratio method, stratifying the data by districts, and using cicne values of 


the ratio for the different districts. 

TABLE 6.10—HERTFORDSHIRE FARMS, 1939: ESTIMATION OF WHEAT ACREAGES 
FROM THE RANDOM SAMPLE OF l IN 20 FARMS BY THE RATIO METHOD AFTER 
STRATIFICATION INTO DISTRICTS 


Soper District Estimated 
District | | crops and district 
N 4 Crops and ; grass wheat 
No. No. Kea | grass | m X: as, | 
|; aC Sil) 
1 | 15 141 1,935 0729 22,932 1,670 
2 8 259 1,385 1870 43,591 8,150 
3 40 763 4s51 | 1673 | 57,263 9,010 
4 24 837 4,034 2075 73,946 15,340 
5&7 14 127 882 +1440 40,905 5,890 
5 é 2 | 
6 24 174 2,027 -0858 34,437 2,950 
s | 
| 125 2,301 15,114 273,074 |' 43,010 


are shown in Table 6.10. The neighbouring districts 


punes COMPACT h of which contains rather a small 


of St. Albans (5) and Watford (7) (eac 
number of farms) have been combined. 
6.11 Ratio method! stratified sample with variable sampling fraction 
(a) Ratio the same for all strata : 
rí tei Si (x)} 
rx 
FX 
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(b) Ratio different for different strata : 


Proceed in the same manner as for a fixed sampling fraction. The sampling 
fraction does not enter into the calculations. 


6.12 Regression method: random sample 


The equation of the regression line is 


=F +b(x—2) 
where 
_ S(y— I) @—8) 
ar a (6.12.a) 
Hence y= +b(%—2) (6.12.b) 
Y=Nj 


If N is not known exactly, it must be estimated from the sample. 


Note that if b is put equal to S(y)/S (x) formula 6.9 is obtained, and if 
put equal to 0 formula 6.4.a is obtained. All values of b will give unbiased 
estimates, and consequently any value bọ which appears appropriate to the 
data under analysis may be used. Thus, taking bọ = 1 is equivalent to the 
use of the differences y — x. The regression method furnishes the value of b 
which gives the most accurate estimate of ¥, using a formula of type 6.12.b, 
at the cost of some additional computational labour. 

Regressions may be used for standardization and in two-phase sampling 
in the same way as ratios. 


Example 6.12.a 


Obtain an estimate of the total area of wheat from the data of Example 6.6, 
using the regression method. 


We have 
= 120:912 f=18-408 N=2496 = 109-405 
S (x?) = 5,061,734 S (xy) = 902,958 S (y?) = 207,261 


© S (x) = 1,827,464 7S (x) = S (y) = 278,219 =F S(y) = 42,357 
S (x — £)? = 3,234,270 S (x —&)(y — 7) = 624,739 S (y — F)? = 164,904 


The method of calculation of the sums of squares and products is explained 
in Section 7.1. The sum of squares of y will be required in the calculation 
of the sampling error. 


We then have 
624,739 
= 3,234,270 
7 = 18-408 + 0-19316 (109-405 — 120-912) = 16-185 
Y = 2496 x 16-185 = 40,400 acres 
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Example 6.12.6 
Table 6.12 gives the measured volumes of timber on 25 systematically 
located plots of 1/10 acre, and eye estimates of the volumes per 1/10 acre in 


TABLE 6.12—]MEASURED VOLUMES, y, ON 25 SAMPLE PLOTS, AND EYE ESTIMATES, 
x, OF CORRESPONDING STANDS (cu. FT. PER 1/10 acre) 


y x y x y x y x 
170 102 195 208 153 79 169 152 
47 14 255 Soe 216 177 182 ieat 
64 57 135 110 125 65 T4 148 
91 70 146 110 100 196 24 207 
126 95 154 110 287 167 255 167 
146 92 110 110 261 268 3,684 3,302 
87 110 112 128 147-36 132-08 


the stands in which they occurred. If more than one sample plot occurs in 
a stand this is indicated by`a bracket, but the observations have been treated 
as independent in the subsequent computations. The data, which refer to 
f uniform age and over 20 years of age in two counties, were 
938-9 Census of Woodlands. They are plotted 
f conifer stands over 20 years of age in these 
and the total volume of timber, from eye 
6,110,000 cu. ft., i.e. 1192 cu. ft. per acre. 
is class of timber from the 


conifer stands o 
obtained in the course of the 1 
in Figure 6.8, The total area o 
two counties was 5124 acres, 
estimates of all these stands, was 
Obtain unbiased estimates of the total volume of thi 


above data. 


ple plots provides an unbiased 


umes on the sam 
me, based 


The mean of the measured vol s 
estimate of the volume per acre, and the estimate of the total volu: 
on the measurements of the sample plots only, is consequently 

147-36 X 10 X 5124 = 7,551,000 cu. ft. 


The ratio method gives 
1192 x 5124 X 1473-6/132 
bias in the ey’ 


Jumes and eye 
gression me! 


0-8 = 6,814,000 cu. ft. 
e estimates by taking the difference 
estimates on the sample plots 


Elimination of possible 
thod with an arbitrary coefficient 


between the measured vo 
(equivalent to the use of the re} 


bo = 1) gives 
4.1192) = 6,891,000 cu. ft. 


5124 (1473-6 — 1320-8 
the calculation of the regression 


The regression method proper requires 


coefficient. We find (i X 
S (y — 7)? = 115,266 S(y — j) (x — 7) = 52,069 S (x — 3)? = 82,296 
b = 52,069/82,296 = 0:63270 
Y= Lees 4 0-63270 (1192 — 1320-8)} = 7,133,000 cu. ft. 
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The relative accuracy of these various methods of estimation is discussed 
in Example 7.12.b. 

The above data are of course only a small part of the full data for the survey. 
Examination of the whole of the data for the above two counties gave a value 
of b of 0-55. The bias in the eye estimates, which are too low, though not very 
large, is apparent in the above data. The average bias over the whole survey 
was decidedly larger, and misleading results would have been obtained by 
using the eye estimates without correction for bias from properly measured 
and randomly located sample plots. 

There is of course the possibility—if the location of the sample plots has 
not been objectively carried out, or if the measurements have been carelessly 
made, e.g. by the inclusion of trees whose centres do not lie within the 
demarcated sample area—that the sample plots will themselves be biased. 
The sample plots used in this survey were somewhat small, and the use of 
larger plots, particularly in the case of hardwoods, possibly with second-stage 
sampling of trees for measurement, would have reduced the risk of bias of this 
nature. The surveyors were well trained, however, and thoroughly appreciated 
the need for objectivity, and on examination it appeared that serious bias from 
this cause could be ruled out. The results of a later survey of England and 
Wales confirmed the correctness of the earlier survey. 


6.13 Regression method: stratified sample with uniform sampling 
fraction 


(a) When the regression coefficient is assumed to be the same for all 
` strata: 


The formule for a random sample hold, except that formula 6.12.a is 


replaced by 
Fe CaS Yin 
D(Si@ — #7} 
(b) When the regression is permitted to assume different values in the 


different strata : 
Proceed as in the ratio method. 


6.14 Regression method : stratified sample with variable sampling 
fraction 


(a) Regression the same for all strata: 
Y = Fw + b (ž — Hy) 


D{2i Si (y — Fi) (w — 2:)} 
Dl Ai Si @ — 2} 
the A; being numerical weighting coefficients, and 
we Z{gi Si(y)} 
X(gi ni) 
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with a similar expression for %». Fw and sw are the estimates of ¥ and X that 
would be obtained from the sample if there were no supplementary information 
on x (see Section 6.7). 

If the regressions within strata are truly linear, with identical values of the 
regression coefficient, then the most accurate estimate of b will be obtained 
if the 2; are taken inversely proportional to the residual within-strata variances 
of y about the regression lines. If the regression coefficients are different for 
the different strata, then the component of error due to the assumption of 
equality of regression coefficients will be minimized by taking 4; proportional 
to g. Any set of Ai will give a virtually unbiased estimate of ¥, and detailed 
investigation of the theoretically best values to adopt is seldom worth while. 
For most work A; may be taken as unity if`all the strata contain material of 
similar variability, i.e. if the variable sampling fraction arises from extraneous 
causes not connected with the variability of the material, and equal to g; if 
the sampling fractions have been chosen so as to minimize the sampling error. 
Under certain conditions 2; = g? would be best in this case, but under other 
conditions this would give excessive weight to the strata with small sampling 


fractions. 
(b) Regression different for different strata : 


Proceed as in the ratio method. 


6.15 Use of regression to calibrate eye estimates 


It sometimes happens that eye estimates or similar subjective Measurements, 
x, can be made on a properly selected and unbiased sample of the population, 
but that the objective measurements y, which are required to calibrate these 
estimates, can only be carried out on a non-random sub-sample of the original 
sample. The eye estimates cannot then be used as supplementary information 
in the manner of Example 6. 12.b, since any bias in the sub-sample used for 
the objective measurements would be reflected to a greater or less extent, 
depending on the value of b, in the population estimate derived from the 
eens case the regression of x on y, instead of y on x, must be calculated, 
and the equation of estimation must be replaced by 


Ey? z 
y =f +77 —%), 
a ient of x on y, J and # are the means for the 
as sie: the eye estimates for the original sample. 


ertain limitations. Firstly, the sub-sample, 
population, must be effectively 


when b’ is the regr 
sub-sample, and #, is the mean 
This procedure is subject to © 


though non-random for the whole DETE If, for example, there is a 
random for units having any given i 


; have high values of « 
tendency to select units which, BANG silo ce if eye estimates 
serious bias may result. Thus, in a crop- 3 


esti. 
are made on a random sample of fields, and if reliance is placed on returns 
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by farmers of the actual yields of some of these fields, any tendency on the 
part of the farmers to return only the yields of fields which have turned out 
better than their appearance would indicate will lead to an overestimate of the 
yield. On the other hand, the omission of a greater proportion of the low- 
yielding than of the high-yielding fields from the sub-sample will not bias the 
results, provided this omission is conditioned only by the final yield and not 
by the previous appearance or the value of the eye estimate. 

Secondly, for accuracy in the final estimate, the eye estimates must be 
reasonably accurate in the sense that variation about the regression line must 
be small, and the line itself must have an adequate slope. If the regression 
line is curved, this curvature can only be allowed for in the estimation formula 
if the variation about the regression line is negligible. Otherwise bias will be 
introduced. The use of the best fitting linear regression line, however, will 
avoid this source of bias. 


Example 6.15 


In order to test the accuracy of eye estimation as a method of estimating 
the yields of cereal crops shortly before harvest, a trial survey of the wheat 
crop of Hertfordshire was undertaken in 1940. Two observers were employed, 
one of whom visited 47 farms, observing 110 fields, and the other 16 farms, 
observing 37 fields. The whole set of farms constituted a systematic sample 
of 1 in 12 farms, excluding those growing less than 5 acres of wheat in 1939, 
a random sub-sample of fields being taken on the larger farms. The actual 
yields, as determined by the farmers, were subsequently obtained for as many 
of the observed fields as possible, and these were used to calibrate the eye 
estimates. The relation between the eye estimates, x, and the actual yields 
per acre, y, for the first observer are shown in Fig. 6.15 for the 37 fields for 
which yields were obtained. Obtain an estimate of the mean yield per acre 
for the part of the county covered by this observer. 


The regression coefficient, b’, of x on y, calculated from the unweighted 
values of x and y for the 37 fields, is 0-6926, the regression equation being 
æ, = 30-00 + 0:6926 (y — 28-78) 

This is shown by the full line in the figure. The dotted line represents the 
line that would be obtained if there were no errors in the eye estimates. It 
will be seen that there is a tendency to underestimate high yields and over- 
estimate low yields. The other observer and the farmers gave very similar 
results. 

The mean of the eye estimates %, for the whole of the first observer’s 
sample, and that for the eye estimates # and the yields f of the fields for which 
yields were available, weighted according to the acreages of the fields, with 
an additional raising factor if all the fields on the farm are not sampled, are, 
in bushels per acre, 

#, = 30-12 š = 30-13 y = 28:95 
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Hence, since 1/0-6926 = 1-444, the final estimate of the yield per acre is 
y = 28-95 + 1-444 (30-12 — 30-13) = 28-94 


The adjustment is here negligible, since <, and £ are almost identical. 
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oF 37 FIELDS O. 
abilities proportional to size of unit 


(a) Size, x, of all units of the population known, or X known: 


i a supplementary variate, and the ratio method will 
seg oe ae ya T Sie the probability of selection is proportional 
5 Sais aus Proportional to 1/* must be introduced into the formulz 
Already, given. This leads to the formule 


6.16 Sampling with prob: 
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In other words the unbiased estimate of the population value of the ratio is 
given by the arithmetic mean of the ratios from the selected sampling units. 


(b) Total size X of the population not known : 

In this case X, as well as Y and F, have to be estimated from the sample. 
Selection has to be made by some such process as randomly or systematically 
locating points on a map, and points not falling in the units under consideration 
must be taken into account. If mp is the total number of sampling points, 
and A is the total area covered by the sampling grid, we have 

r=? 
X =A njn 
Y=fFX=7An/n 


Alternatively, if A is not known exactly the density d of points per unit area 
may be used. We have 
A=n,/d 
X = njd 


If the sampling is two-phase, with m9’ points (density d’) at the first phase, 
of which 7’ fall in the units under consideration, 7, nọ and d must be replaced 
by n’, nọ and d’ in the above formule. 


Example 6.16 .a 


In a survey to estimate the area and yield of a crop, systematically located 
points at a density of one per 4 square miles are taken, and the yields per acre 
of the fields in which the points fall and which carry the crop are determined 
by the harvesting of small areas. 8317 points in all are obtained, of which 
529 fall in fields carrying the crop. The arithmetic mean of the yields per acre 
of the selected fields is 15-7 cwt. per acre. Estimate the total area and yield 
of the crop. 


A density of 1 per 4 square miles is equal to 1/2560 per acre. Hence 


area = X = 529 x 2560 = 1,354,000 acres 
yield = Y = 15-7 x 1,354,000 cwt. = 1,063,000 tons 


Example 6.16.6 


If, in addition to the yield data of Example 6.16.a, a further 24,938 points 
were surveyed for type of crop only, giving an overall density, with the 
8317 points of the yield survey, of one point per square mile, and 1673 of the 
fields so located were found to carry the crop in question (in addition to the 
529 fields above), obtain revised estimates of the total area and yield of the crop. 


This is an example of two-phase sampling. The two sets of points together 
constitute the first phase, for area of crop, and the 529 points for which yield 
samples were taken constitute the second phase. 
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We therefore have 
no = 24,938 + 8317 = 33,255 
and 
n’ = 1673 + 529 = 2202 


The density at the first phase is 1/640 per acre. Consequently 
area = X = 2202 x 640 = 1,409,000 acres 
yield = Y = 15-7 x 1,409,000 cwt. = 1,106,000 tons 


6.17 Sampling from within strata with probabilities proportional to 
size of unit 

In this case the sizes of all units will be known. As pointed out in 

Section 3.10, if more than one unit is selected from some or all of the strata, 

the same unit not being selected twice, the probabilities will not in fact be 

exactly proportional to size, and slight bias will be introduced. The ratios 

from the selected units are meaned separately for each stratum, giving 


equations of estimation 


Example 6.17 
vheat and the number of farms growing wheat 


Estimate the acreage of v $ i c 
in Hertfordshire eth sample of parishes described in Section 3.11. 


TABLE 6.17.a—HERTFORDSHIRE WHEAT: SAMPLE OF 17 “ COMBINED ? 
PARIEDES SELECTED FROM WITHIN DISTRICTS WITH PROBABILITY PROPORTIONAL 


TO SIZE 
2 3 
- District . . 1 | 2 | 
Wheat e O A "S11 p 686 


2,370 3,330 2,290 2,930 
| 


Crops and grass 3,350 | 3,040 3,440 2,040 
[049 | -252 204 -247 [131-068 109-234 


Ratio 
ae} G 2 
District . -] 4 TIERED | 7 
Wheat s58 775 495 565 862 sis | 225 738| 290 


Grove andl gabe 2300 AaS 4,160 |3,470 |2,520 3,740 | 3,060 
| Bag ia an 283) 207 | -236 | -089 -197 | -095 


Ratio 
169 


sEcT. 6.18 SAMPLING METHODS FOR CENSUSES AND SURVEYS 


TABLE 6.17.b—EsTIMATION OF WHEAT ACREAGE FROM THE DATA 
or TABLE 6.17.a 


| Mean ratio Acreage of Estimated 
District | wheat/crops crops and grass acreage 
| and grass in district of wheat 
1 -049 22,932 1,120 
2 +234 43,591 10,200 
3 +136 57,263 7,790 
4 +206 73,946 15,230 
5 +236 24,964 5,890 
6 | +143 34,437 4,920 
7 | 095 15,941 1,510 
| 
Tora || 273,074 46,660 
| 


The data from the sampled parishes are shown in Table 6.17.a, and the 
further computations for wheat acreage in Table 6.17.b. 

The computations for number of farms growing wheat follow exactly the 
same lines, using the ratio of number of farms growing wheat to acreage of 
crops and grass for each parish. These computations are left as an exercise 
to the reader. 

The use of the ratio of number of farms growing wheat to the total number 
of farms in the district would be equally admissible if the selection of parishes 
had been made with constant probability, but when the probability of selection 
is taken proportional to size of unit, the ratio to size, however defined, must 
be taken for all variates. 


6.18 Multi-stage sampling, no supplementary information 


In multi-stage sampling the process of estimation can be carried out stage 
by stage, using the appropriate methods of estimation at each stage. It is often 
more convenient, however, to combine all stages in a single process of 
estimation. 

Thus the combined or overall raising factor g for any sub-unit in two-stage 
sampling is given by the product of the first-stage raising factor g’ of the 
main unit in which it occurs and the second-stage raising factor g” of the 
particular sub-unit, t.e. 


g=g g" 
Hence the general formula for Y, when there is no supplementary information, 
is 

Y = S(gy) 


where the summation is taken over all units, with similar formule for N, etc. 
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If the combined raising factors for a group of units are equal then the 
computations will be simplified by summing these units before multiplication. 
In particular, if all the combined raising factors are equal, the sample can be 
treated for purposes of estimation as if it were an ordinary random or stratified 
random sample with uniform sampling fraction. 


6.19 Multi-stage sampling with supplementary information 


(a) Ratio method, ratio the same for whole population : 


etc., where g =g' g". 

(b) Ratio method, ratio different for different parts of the population : 

Many variants are possible. All can be resolved by proceeding stage 
by stage by the methods already outlined. The danger of introducing 
bias if the number of units on which the ratios are based is small must be 
recognized. 

(c) Regression method : 

Regressions will usually b 
which case the regression coe 
manner appropriate to the type ©} 
values of the totals of x and y for each main 
stage sampling. 

If regression is used 
can be used, treating t 

(d) Sampling with probability proportional to size : 

An important case is that in which the first-stage units are sampled from 
within strata with probability proportional to size, and the second-stage 
sampling fractions are chosen so as to give a uniform overall sampling fraction. 
In this case the use of the mean ratios 7;' at the first stage of the estimation 
process (Section 6.17) will be found to be equivalent to the direct estimation 
from the second-stage units by means of the overall raising factors, i.e. Y = S(gy).* 


Example 6.19 

From the data of Table 6.1 
Fertilizer Practice, estimate the av 
sugar beet in Norfolk. 


The two-stage sampling ; 
Section 4.23. Information 15, 
changes in tenancy. Since this a 


e employed at the first stage of the sampling, in 
fficient or coefficients will be calculated in the 
f sampling adopted at this stage, using the 
unit estimated from the second- 


at the second stage the procedure for stratified samples 
he selected first-stage units as if they were strata. 


9.a, obtained in the course of the Survey of 
erage dressing of nitrogenous fertilizer on 


procedure of this survey has been described in 
lacking from a few farms, mainly owing to 
ffects the small farms to a greater extent 
than the large farms the adjusted sampling fractions shown in the table have 
been used. These are equal to the number of farms on which information 
is available divided by the total number of farms in the size-group. The 
* The estimation of error of this type of sampling is described in Section 10.9. 
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second-stage sampling fractions are given by the reciprocals of the number 
of fields, since one field is selected on each farm. The combined raising factor 
for the sampled field of the first farm in Table 6.19.a is therefore 105, that for 
the sampled field of the third farm is 105 x 3 = 315, etc. 


TABLE 6.19.a—SuRVEY OF FERTILIZER PRACTICE: DATA ON THE APPLICATION 
OF NITROGENOUS MANURES TO SUGAR BEET ON OLD ARABLE LAND IN NORFOLK 


No: Acreage Gwe. No. Acreage Cat No. Acreage Oe 
o —— 7 of ~a o: m 
Ids er per er 
fie Total Sample P es fields Total Sample ae fields Total Sample He 
Small farms Medium farms Medium farms (contd.) 
1 2 2.) | -68 2 13 10 -42 1 2 412 16 
1 6 6 4 40 5 -63 1 6 6 -30 
3 an 2 1 $8 eas Du 7 42 
2 4 3 : 14 4 42 3 21 6 14 
1 5 5 al an ae 1 8 8 49 
1 3 3 2 16 10 “90 2 10 2 +49 
1 4 4 3 19 7 21 1 4 a ai 
1 2 2 E A aM mes 1 4 4 30 
2 6 2 3 30 4 83 F ld 7 46 
2 8 4 2 19 13 -52 6) 49 O ese 
1 2 2 1 9 9 “52 1 6 6 21 
2 4 3 1 20 20 42 1 4 4 “30 
1 6 6 5 26 7 30 2 19 8 2 
2 5 4 2 8 6 -54 Large farms 
2 oe 4 4 20 8 21 1 8 8° “ho 
1 2 2 1 4 4-42 1. 48 487 Mo 
1 4 4 1 20 20 “42 3 19 4 0 
1 5 5 2 i 4 36 3 56 24 68 
2 3 2 4 32 11 “63 2 22 5 “36 
1 7 7 2 16 8 +82 4 30 5 +42 
ek ead 2H) 26.) B33. haba 3 20 5 ema 
eee tod 2 2 10 56 6 126 29 +75 
i ga E Cheer EE EG Gp 


Number of farms without sugar beet on old arable land: small (6-50 acres), 8; 
medium (51-300 acres), 11; large (301- acres), 5. 


Sampling fractions (adjusted for absence of information) : small, 1/105; medium, 
1/59; large, 1/30. 


The average dressing of nitrogen must be obtained by calculating the 
raised total of the amount of nitrogen applied S(gy) and the raised total of 
the acreage sampled S (gx). The amount of nitrogen applied to a field is given 
by the product of the acreage and the rate per acre. The three size-groups 
are best kept separate in the computation. For the first size-group, therefore, 
applying the second-stage raising factors, we have E 
S (g"y) = S (gr) =1 X 2 X 0:68 +1 x 6 x 0:63+3 x 2x 0:55 + 

=1:36 +378 +3304 .. , 
S(g’*)=1%42+1x6+3x2+ .., 
=e Gee 6 a ene 
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This gives the results shown in Table 6.19.b. Applying the first-stage raising 
factors to the total nitrogen and total acreage, we obtain the average dressing 


of nitrogen per acre: 
46-138 x 105+ . . - _ 28,512-04 


5 
Ee Ds a 


TABLE 6.19.b—ESTIMATION OF AVERAGE DRESSING FROM RAISED RESULTS 


| 
I 


Total Total Nitrogen First-stage 
Size-group nitrogen acreage per acre | raising factor 
S(g’y) S(g") | SEDIS E) g 
Small a a 46:13 104 "444 105 
Medium . 285-41 601 475 59 
Large 227-64 395 576 30 


a small part of the information collected, 
and the above method of estimation therefore demands a good deal of 
computation. For certain purposes comparative figures may be obtained 
from the straight averages of the dressings per acre on the sampled fields. 


The data presented comprise only 


BLE 6.19.c—ESTIMATION OF AVERAGE DRESSING 


TA 
FROM UNWEIGHTED MEANS 


Size-group No. of farms 
Sum Mean 
ye ee 
i 
2: 9-30 +423 
Small 3 22 
5 16:32 453 
Medium 36 5 
9 3-74 416 
Large 
29-36 +438 
All 67 
| T a ee ee 


6.19.c. The first-stage raising factors 
his calculation, since the larger farms have more and 
so that inequality in the omitted second-stage raising 
nsates for the difference in the first-stage raising 


These averages are given in Table 


have not been used in t 
larger fields in sugar beet, 
factors more than compé! 


factors. 
173 


sEcT. 6.20 SAMPLING METHODS FOR CENSUSES AND SURVEYS 


It will be noted that the mean dressings are less than those previously 
obtained for all size-groups, indicating the possibility that farms with little 
sugar beet, which are overweighted in the straight averages, are using less 
nitrogen per acre than those with a large amount of sugar beet. The data are 
too variable, however, to determine with certainty from this sample alone 
whether this is really a bias or is due to random sampling errors. The large 
difference for the large farms, for example, is due to farm 8 having a very 
large acreage of sugar beet. The relative accuracy of the two methods of 
estimation, apart from bias, is discussed in Example 7.17. 

In the Survey of Fertilizer Practice the second method of estimation was 
used in investigation of secondary points, e.g. comparison of different types 
of farms. For the more important estimates, such as mean dressings per acre, 
a modification of the first method was used, the total acreages of sugar bect 
on the farms being taken as the raising factors for the second stage. This 
method of estimation is slightly more accurate than the first method given 
above, but will be biased if there is any tendency for farmers to apply heavier 
(or lighter) dressings to their large fields. There is no evidence that any 
appreciable bias does in fact arise from this cause, but even so it is perhaps 
doubtful whether there is much advantage in using this method of estimation 
rather than the unbiased method given above. The method would have been 
unbiased had selection of fields within a farm been made proportional to area, 
but this would have demanded somewhat more elaborate methods of selection 
in the field. 


6.20 Systematic and balanced samples 


The methods of estimation described in the preceding sections are also 
appropriate for systematic and balanced samples. Samples of these types 
without other restrictions, for instance, can be treated as if they were random 
samples for the estimation of the population values (but ot for the sampling 
errors) ; if there is stratification the procedure for stratified random samples 
holds. An example of a systematic stratified sample with variable sampling 
fraction has already been given (Example 6.7). 

Certain estimation processes are naturally inappropriate to systematic and 
balanced samples. If the process of selection in a systematic sample is such, 
for example, that stratification is automatically introduced, there will be no 
gain from stratification after selection. Equally in a balanced sample the 
variate for which balance has been effected will be of no further value as 
supplementary information—the balance ensures that the corrections based 
on regression, or ratio, will be zero, whatever the value of the regression 
coefficient. If each stratum is balanced separately, then the corrections for 
the different strata will all be zero, even if the regression coefficient or ratio 
varies from stratum to stratum. 

In systematic samples of material which varies in a continuous manner, 
some gain in accuracy may result from the use of what are known as 
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end-corrections. These corrections are made by assigning to the boundary 
observations weights which depend on their distance from the boundary. In 
systematic one-dimensional sampling of a line AB (Fig. 6.20), for example, with 


=f algae Eee tT = ate 
we pm Sp Se oR OS Gy R 
Fic. 6.20—SYSTEMATIC SAMPLE, P}, P} .. . Pe OF THE LINE AB 


sampling points located at P,, Py, - - « Po if Qi, Qas - - - Qs are the mid- 
points of P,P, etc., we may regard the observations at Py Ps Pi By 38 
estimates covering the lengths 0,02, Q2Q5, etc. The observations at P, 
and P, can similarly be regarded as estimates covering the lengths AQ, and 
Q,B. Consequently the weights assigned to P, and Pe relative to that assigned 
to Pa, Ps, etc., will be AQ,/Q,Q2 and Q;B/Q10>. 

The same principle, or an adaptation of it, might be applied in the case 
of two-dimensional systematic sampling of areas. End-corrections, however, 
are not likely to be of much value in the type of material usually dealt with 
in census and survey work, and we shall not discuss them further here, beyond 
mentioning that if the regions near the boundary differ from the remainder 
of the area, the use of end-corrections, instead of separate treatment of the 
boundary regions in the manner outlined in Section 3.14, will lead to biased 


estimates. 


6.21 Sampling on two successive occasions 


re for estimating the values of the 


The most straightforward procedu: i t 
occasions is to treat each occasion 


population mean on two successive ns is to, } 
separately, following whatever method of estimation 1s appropriate to the 


sample obtained on that occasion, regardless of the values obtained on the 
other occasion. Such estimates may be termed overall estimates. 

With independent samples on the two occasions, or with the same fixed 
sample on each occasion, the overall estimates will contain virtually all the 
available information, but where a sub-sample is taken on the second occasion, 
or there is partial replacement of the sample on the second occasion, the 
situation is more complicated. F 

If the sample on the second occasion 1s confined to a sub anpe of the 
original sample, change will be most simply estimated from the differences 
of the units included in the sub-sample only. An estimate of the population 
mean or total on the second occasion is similarly obtained by adding the 
estimated change to the overall estimate on the first occasion. pe most 
accurate estimate of the population mean on the second occasion will, however, 
be obtained by calculating the regression of the sample values for the second 


occasion on the corresponding values for the first occasion, and using the 
occasion as supplementary information. The 


t 
apri pvalues for fig T me as has been already outlined for use of 


procedure is exactly the sa 
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supplementary information by the regression method, the method appropriate 
to the type of sampling used being followed. The most accurate estimate of 
the change can then be obtained by taking the difference of this estimate of 
the mean from the overall estimate for the first occasion. 

The formule for this procedure are as follows. Denote the sample values 
on the first occasion by x and those on the second occasion by y, the values 
belonging to units included in both samples by x’, y’, and those included on 
the first occasion only by x”. If a fraction / of all the units included on the 
first occasion are taken on the second occasion, and a fraction yz equal to 1 — 4 
are omitted, then, for a random sample, 


where X is the overall estimate for the first occasion and Y the adjusted estimate 
for the second occasion. The change is consequently estimated by 


7—3 =F —# ull bE) 


The calculation of the regression coefficient is based on the values of the units 
which are included on both occasions. 

If the changes of individual units are small compared with the differences 
between units, i.e. if the correlation between units on the two occasions is very 
close, as is likely to be the case when this type of sampling is adopted, b will 
be nearly equal to unity, and the estimate of change will differ little from that, 
y — 3’, derived from the units included on both occasions only. Equally 
the estimate of the population mean or total will differ little from that obtained 
by adding the estimate of change to the overall estimate on the first occasion, 

When a sample of the same size is taken on each occasion with partial 
replacement, a fraction x of the units being replaced and a fraction 4 being 
retained, the sample units which are retained can be used to furnish an 
estimate y, of the population mean on the second occasion by the regression 
method already given. In addition there will be a further independent 
estimate Yp, equal to 7”, derivable from the sampling units which are included 
on the second occasion only. The most accurate estimate Jw will be provided 
by a weighted mean of these two estimates. The correct weights are 
Aj. — per?) and (1 — pr*)/(. — p*r*), where r is the correlation coefficient 
between the unit values on the first and second occasions. 

The correlation coefficient is calculated in the same manner as the regression 
coefficient b, using the values of the units common to both occasions, with 
the exception that instead of dividing by a quantity of the type S(x — %)? 
we divide by the corresponding quantity of the type »/{S (x — &)? S (y — 7)?}. 
Thus, for a pair of random samples, 


S (x = #') (9' —7) 
VES (0 — #7 S (9° — 9)" 
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In the more complicated types of sampling the sums of squares and products 
are modified in the same manner as in the calculation of b. 


We thus have 


= ee (AUS 
w= prh t pr 7? 

NE NS 
E DN 9 


If the numbers in the sample on the two occasions are not the same the 
above formula takes the modified form 

x n {7 + b(K—#)} +n" (Laur) y" 

Yu = en" (1 — pr) (6.21.a) 
mber of units re-sampled on the second occasion, 7#’ is the 
number of new units, and is the proportion of units sampled on the first 
occasion which are not re-sampled on the second occasion. 

An estimate of the change can similarly be obtained by taking the weighted 
mean of the two estimates, 9’ — # and f” —#”. The weights to be assigned 


to these estimates are 2/(1 — mr) and u (1 —7)/(1 — ur), so that 


ea ese E ie are 
FAG Nt ae, 8 ee) 
mate of the change will differ from that given by the difference 
mate on the first occasion, The reason for this is 
he second occasion has been taken, a more accurate 
estimate of the population mean on the first occasion is possible by using the 
information provided by the sample on the second occasion as supplementary 
information. If this revised estimate Xw is calculated, then the estimate of 
change given above will be very nearly equal to Yw — Xw. The slight discrepancy 
arises from the fact that unless the variances on the two occasions are equal 


the estimate of change given above is not quite the most accurate possible. 
It will be noted that when 7 is equal to 1, the above PE os eee = 
equal to 7’ — x’, whereas if r equals 0 the estimate 1s ae A i Tag 
of the overall estimates of the population me; Sim ar Ja meea 0, 
Jw equals the overall mean of y, and if the values of each unit ae 3 same en 
both occasions (#' =F’, 7 = 1D a te sanded Ee, ae f 
values included on each occasion, each value Bee are oa ong Exe i 
A further practical point arises in connection y 3 he es! maton a ae 
the variability on the two occasions is the same, bot! = e regressio. ts, 
oefficient. In most material 


5 the correlation c 

on x and x on y, will be equal to j: are 
k ak aa n successive occasions of the type under consideration 
or nihi simp ea ty on the different occasions may be expected 


is li the variabili herdi 
to a a “AT for such material it is best to replace b by 7, as the 
> pon 
latter is less subject to errors of estimation. 
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Example 6.21 


The percentage solids-not-fat in two successive months for all the 16 cows 
in a herd which were in their 2-6 months of lactation in one, at least, of these 
months were observed to be: 


Cow 3 > 1 2 3 4 5 6 % 8 
November < 8-82 8-94 9-86 8-90 9-00 9-13 8-90 9-02 
December . = = = — 8-98 8-66 8-68 8-86 
Cow & . 9 10 11 12 13 14 15 16 
November - 946 9-52 9:28 9-22 == = = — 
December - 9:30 9-50 9-18 9-32 9-38 8-78 9-10 9:04 


Estimate the change in percentage solids-not-fat between the two months, 


the mean percentage for December and the revised mean percentage for 
November. 


The sampling is here due to natural causes, and the number of cows in 
their 2-6 months of lactation will therefore not remain completely constant. 
The above two months were selected from more extensive records. 


We find: 


S(x") = 13-53 S(y') = 72:43 
S (x) = 36-52 S (y”) = 36-30 
# = 9-1912 J' = 9-0538 F — 8 = — 0:1374 
CACHE J = 9-075 J” — x" = — 0-055 
z= 9:1708 J= 9-0608 j—š4=—011 
S (x! —3')?= 0:3435 S (x! —2') (y —5') S(y'—y')2?= 0:6742 
= 0-4076 


r = 0-4076/4/(0-3435 x 0:6742) = 0-847 


It will be noted that the value of b is greater than unity, which illustrates 
the point made above that r provides a better estimate of the regression in 
material of this kind. The estimation of 7 should normally be based on more 
extensive data, though no very high accuracy is required—100 pairs of values 
will be fully adequate. In the present case the more extensive data confirm 
that the above value of 7 is about correct. 


We also have A = 3, u = 4}, and thus 


A u (1 — ur’) 
E 2—02 
popp e = 0276 
= 0-929 l-r) =0-071 
1 —apr l= 


Ju = 0-724 {9:0538 + 0-847 (9-1708 — 9-1912)} + 0-276 x 9-075 = 9-0471 
Change = 0-929 (— 0-1374) + 0-071 (— 0-055) = — 0-1315 
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The estimate of the mean for November, revised on the basis of the 
December values, is 

Xw = 0-724 {9-1912 + 0-847 (9-0608 — 9-0538)} + 0-276 x 9-13 = 9-1786 
giving the check, 9:0471 — 9-1786 = — 0-1315. Agreement is here exact, 
since the two regression coefficients, y on x and x on y, have both been taken 
equal to r. 


6.22 Sampling on a number of successive occasions 


The formule of estimation given in the last section cover all cases of 
sampling on two occasions only. When sampling is carried out with partial 
replacement on more than two occasions no such simple general solution is 
possible, but certain approximate solutions, which are very similar in form to 
those for sampling on two occasions, are likely to be sufficient for most practical 


purposes. 

In a sampling scheme which is repeated at intervals it is generally desirable 
to provide as accurate an estimate as possible of the population mean on each 
occasion without any revision of the estimates for previous occasions. Suppose 

te which can be obtained for occasion h, 


that yn is the most accurate estima! à ¢ i h 
taking into account the results of the sampling up to and including this 


occasion A, and that Ya -1 is the similar estimate for occasion h +1, taking 
into account the results up to and including occasion h— 1. Subject to certain 
limitations, ¥, and Ja-ı 3e related by a formula of the type 
5 z 5 B 

Fn = P) Fr +7 n-1— -Dy H pn” (6-22.a) 
where suffices indicate the occasion, single dashes units common to occasions 
hand h — 1, the mean on the earlier occasion being distinguished by square 

2 . . o 
brackets, and double dashes units occurring on occasion h only. 

The ‘limitations are that a given fraction of the units 1s replaced on each 
occasion, that the variability on the different occasions and the correlation 
r between successive occasions are constant, and that the correlation between 
occasions two apart is r2, that between units three apart is 7°, etc. This last 
condition is only necessary when units are included for more than two occasions, 
and no great loss of accuracy will occur under normal circumstances if it does 
not hold exactly. . 

The value of Q depends on the value of 7, on the fraction u replaced on 
each occasion, and on the number of occasions h on which samples have already 
been taken. i With increasing h, P rapidly tends to a limiting yalue, which 
depends only on 7 and u. This limiting value is 

fit ese 424)}] 
P =e 21r? 
— 2 have been given in the previous section. For 
limiting value of p may be used for all occasions after 
formula for # is due to Mr. H. D. Patterson). 
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+ 

When the value of r has been determined, the values of @ can be calculated 
and formula 6.22.a used. 

For most practical purposes Y — Ya —, will provide an adequate estimate 
of the change between occasions # —1 and h. If change is of particular 
interest, however, formula 6.21.b may be used.* This latter estimate will 
of course not agree exactly with Y, — Yn — ı and will therefore lead to apparent 
inconsistencies in the summary of the results. 

It sometimes happens that the sampling scheme, though broadly following 
a partial replacement procedure, gives rise to some inequality of numbers of 
units on the different occasions. This can be allowed for by substituting for 
p the value g’ given by 


pP =P (6.22.b) 


where 7p is the number of units on occasion h, and n” is the number of units 
not included on the previous occasion. 


TABLE 6.22—PERCENTAGE SOLIDS-NOT-FAT : ADJUSTMENT OF SAMPLES TAKEN 
ON SUCCESSIVE OCCASIONS 


January | February | March April May June 
Yn 3 : |5 9-400 |11 9-090 | 9 9-111 |10 9-059 |8 9-211 |11 9-345 
| 
y's = 4 9-288] 8 9-122 | 8 9-086 |3 9-060 | 7 9-326 
| 
iy = 7 8-977] 1 9-020 | 2 8-950) 5 9-302 | 4 9-380 
{yah = 9-341 9-163 9-188 9-226 9-322 
| 
Yn n 9-400 9-122 9-151 9-152 9-262 9-338 
yn] 4 9-335 8 9-072 | 8 9-025 3 8-947 | 7 9-267 — 
EEN + 065 + -050 | + -126 +205 | — -005 a= 
g N F — -603 "084 “151 4.72 275 
From 
differences 9-400 9-353 9-403 9-464 9:577 9-636 


* To obtain the most accurate possible estimate of change the info: i 
occasions prior to  — 1 would have to be taken into account The eE ee fom 
investigated by Mr. H. D. Patterson. een. 
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Example 6.22 


Similar data on percentage solids-not-fat to those of Example 6.21 are 
given in abstract form in Table 6.22 for the months January-June. Only 
the 3-5 months of lactation are included. Obtain estimates of the mean 


percentage in successive months. 


The table shows for each month the overall mean Jp, the mean of cows 
occurring in the previous month Fn’, and the mean of new cows Fa”. The 
numbers of cows on which these means are based are also shown. The mean 
for the month k —1 of cows occurring in months h and h —1 is shown 
Jp'] in the column for month h—1. Thus 9-335 and 9-288 are 


in the line [J 
the means for January and February of the four cows occurring in both these 


months. 

Summation of the sums of squares and products of deviations of pairs 
of entries for successive months from January to December gives an overall 
value for r of 0-811, so that 7° is 0-657. The similar calculation of the 
ives r’ equal to 0-746. The assumption 


correlation between months two apart g1 e ; I 
that 7’ equals 7? therefore somewhat underweights the information obtainable 


from occasions two apart. i 
The average T u over a long period will be 1 /3, but considerable 


fluctuations in numbers occur from month to month. For u = 1/3 the value 
of @ for occasions subsequent to the second is 
_ (1 — 0-657) + (1 — 0-867) — 0657 0 = 4.2/3.1/3)} _ 0.252 
ae 2 X 2/3 X 0-657 


Hence for March 


1 
1 — — 0-252 = 0-084 
y= 93° 


are shown in Table 6.22. T 
“ain EE seisad. eO formula 6.21.a may be used. This is of the same 


form as formula 6.22.a, and gives 
ee Sich = 0-603 
OO eel 0-657/5) 
3 been calculated, and corrected by means of formula 


Had the value for y = 1/ 


6.22.b, we should have obtained 
2/3 o 
2 sales 


a= = 0-281 
p =l- 7 = 0657/9 


g = ui 0-281 = 0-536 
which does not differ greatly from the correct value. Equally o differs little 
from the value 0-252, obtained above for subsequent occasions. 
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The remainder of the calculations follow a standard pattern. The quantity 
{7n} equal to Fp’ + 7 (Yn —1 — [Fr -1]), is calculated, and the weighted mean 
of J,’ and {Ya} taken, with weights equal to g’ and 1 — ọ'. Thus for 
February 

{7n} = 9-288 + 0-811 x (+ 0-065) = 9-341 
Yn = 0-603 x 8-977 + 0-397 x 9-341 = 9-122 


The overall estimates J, and the estimates from differences Jy’ — [Fp -.1] 
are shown for comparison. The differences show a tendency to cumulative 
errors, which is to be expected even with close correlation. 

It will be seen that once a value for r has been determined, and provision 
has been made to abstract the means ¥,’, Jp”, and [¥,'—,], the calculations 
are very simple, and can easily be undertaken for large-scale surveys, even 
when a number of different quantities require estimation. 
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CHAPTER 7 
ESTIMATION OF THE SAMPLING ERROR 


7.1 Sampling errors of a random sample 


The general principles involved in the estimation of sampling errors can 
best be made clear by considering the error of a random sample drawn from 
a large population. 

Consider first a sample consisting of a single unit. Let the mean of the 
population be y, and let the deviations of the individual values from this mean 
be ayy aan = oi 90 hat a =i — 9, 32 =J2— y, > - . Then the 
actual error in the estimate of the mean from a sample of one unit will be zr, 
where r is the selected unit. 

The mean of all the 2’s is zero, and therefore the average of the errors 
of the estimates from a large number of samples of one unit (having regard to 
the signs of the errors) will approximate to zero. This is equivalent to saying 
there is no bias in the estimate. 

In order to obtain a measure of the magnitude of the expected error we 
must therefore obtain some form of average of the 2’s which does not take 
account of sign. One simple measure which might be taken is the average 
of all the z’s without regard to sign, but an alternative measure, which has a 
number of statistical advantages, is provided by the mean of the squares of 
all the z’s. This is termed the mean square deviation of y or the variance of y, 
and is denoted by V (y) or o?, and its estimate by V (y) or sè. The square 
root of this variance is termed the standara deviation o of a single unit. 

In the same way, if a sample contains a number of units we may define 
the sampling variance of an unbiased estimate, say y, derived from such a 
sample as the mean of the squares of the actual errors of a large number of 

This variance will be denoted by V(j), or, if 


samples of the same size. í 1 1 
t of this variance is generally termed the 


estimated, by V (¥)- ‘The square roo t e l 
standard ss y the estimate, and will be denoted by S.E. (J), or, if 


estimated, by S.E. (Y). The term standard error is also sometimes applied 
to the standard deviation of a single unit, particularly when the deviations are 
in the nature of errors of observation. ? 

The standard error of the estimate of the population mean derived from 
a sample of one unit is therefore equal to the standard deviation of a single 
unit. Ifa sample of two units r and s is taken, the actual error of the estimate 


of the population mean will be 
y-yao—-Farert a) 
The standard error of the estimate will therefore be given by the square root 
of the average value of 4 
t (ar + zs) = zer ag ae ap 2ar 3s) 
Jf the population is large the average value of Zr zs is zero, as can be seen if 
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we consider a series of samples having the same first unit r and different second 
units s. The average values of 2;° and z are both o°. Hence the average 
value of the above expression is }c?. Consequently the standard error of the 
estimate is o/4/2. 

It will be noted that the above argurhent does not depend on the form of 
the distribution of the z’s—there is no need, for example, for positive and 
negative deviations to be equally frequent. It does, however, require that 
each unit of the sample shall be randomly and independently selected. If, 
for instance, there were a tendency to select a second unit with a deviation 
similar to the first unit the average value of zr zs would not be zero. 

The argument can easily be extended to a sample of 7 units, for which the 
variance and standard error of the estimate will be found to be 


V= SEG)=pe (7.1.2) 


We thus have the important general result that the standard error of the 
estimate of the mean of a large population from a random sample is inversely 
proportional to the square root of the number of units in the sample. 

The standard error of the estimate of the total follows immediately from 
the rule that if Z is any multiplier the standard error of ly is equal to J times 
the standard error of y, provided / is not subject to sampling variation. Thus 


SE. (Y) = S.E. ( gny) = gn {SE. (7)} = gor/n (7.1.b) 


Although S (y) is not itself an estimate it is often convenient to consider its 
sampling variance or standard error. From the above rule 


V{S(y)}=n2V(y), S.E. {S (y)} =oy/n (7.1.c 


The standard error of an estimate can be expressed as a percentage of the 
population value of the estimated quantity. This form of expression is useful, 
as the percentage standard error is unaffected by the units in which the estimate 
is expressed, and the percentage standard error of the mean, of the total of 
the sample, and of the estimate of the population total are all equal. Similarly 
the standard deviation of a single unit can be expressed as a percentage of the 
mean value of a single unit. This is sometimes termed the coefficient of variation. 
Denoting it by o %, we have, in a large population, 


S.E. % (7) = S.E. % {S ( y)} = S.E. % (Y) = (6 %)/a/n 


Thus in a population with a percentage standard deviation per unit of 20 per 
cent., the percentage standard error of the estimate of the population mean 
or total from a sample of 100 will be 2 per cent., that from a sample of 400 will 
be 1 per cent., etc. 

In order to estimate the standard error of the mean or total in numerical 
terms an estimate of the value of o will be required. This can be obtained 
from the deviations y — ¥ of the numerical values of the selected units from 
their mean. y —¥ will be nearly, though not exactly, equal to z, and to a 
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first approximation an estimate of o? will therefore be provided by the mean 
_ square deviation S (y — §)*/n. Actually the sum of the squares of the 

deviations from the sample mean is always less than the sum of the squares of 
the deviations from the population mean, as can be seen from the identity 


SO TSO P —2(9 —¥P 
The average value of the first term on the right-hand side is 70°, and the 
average value of the second term is o, since f — ¥ is the error in the estimate 
of the mean. Thus S(y—J)* has an average value of (n — 1)o*, and 
consequently an estimate of o? is given by 


1 =\2 
#8 = 5 S(¥ —5P (7.1.d) 


The divisor 2 — 1 is technically known as the number of degrees of freedom 
associated with the estimate of error, and is equal to the number of independent 
comparisons that can be made between 7 values. 

The calculation of the sum of the squares of the deviations S (y — FF 
is best done from the sum of the squares of the values themselves. By this 
procedure the calculation and squaring of the individual deviations, which 
often involve fractional values, is avoided. One of the expressions 


S(y —IP =S (9) — 
=S(y*) —F S(9) 


= 5 (9) — 5 (S()¥ 


is used. The last term of each of the three expressions is usually termed 
“the correction for the mean.” In calculating 1t from one of the first two 
ken to at least as many significant figures as are 


expressions, 7 must be ta i t fig 
required in the correction. For this reason the last expression 1s often the 


most convenient. 
Sometimes it pay’ 
mean, in which case W' 
S(y -J =S — yo)? — 2 (F — Io)” 


s to use some convenient round number yo as a working 


e have 


etc. 
ble the individual squares should not be 


If a calculating machine is availal i : , 
riten down the sum of squa can be obtained directly by squaring the 
numbers successively without clearing the machine. 

The calculation of the sum of squares from gro’ 


in Examples 7.1.b and 7.2.b. 


uped data is illustrated 


Example 7.1.a 


Estimate the standar 
(assumed large) of which the vi 


d error of the estimate of the mean of the population 
alues of Table 6.4 are a sample. ; 
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The computations are as follows: 


n= 90 S (y?) = 1959-12 
S(y) = 193-8 F S (y) = 1877-92 
J = 969000 S(y—7}= 81-20 
eLa ors 
s = 19 = 4(4 = 2 
4-274 
7 =| enn 2 
S.E. (7) z9 = E04 


Example 7.1.b 


Table 7.1 gives the distribution of family income in a sample of 162 white 
families in Norfolk—Portsmouth, Virginia. Calculate the mean income of the 
sample and the sampling standard error of this mean. 


TABLE 7.1—ANNUAL NET INCOME OF A SAMPLE OF 162 WHITE FAMILIES 
IN NorFoLK-PorTSMOUTH, VIRGINIA, 1934-6 


Calculation by 
Calculation successive 
summation 
eee No. of | Working 
ne famili it Sum of 
intone amilies units um o 
Total squares Su í 
$ (2) x (3)? Total Bite 
(2) x (3) | = (8) x (4) ae 
a) (2) (3) (4) (5) (6) (7) 
600- 10 -3 — 30 90 10 10 
900- 23 —2 — 46 92 33 43 
1,200- | 40 -1 — 40 40 73 116 
1,500- 32 0 0 0 116 
1,800- 28 +1 + 28 28 57 105 
2,100- 20 +2 +40 80 29 48 
2,400- 4 +3 +12 36 9 19 
2,700- 2 +4 +8 32 5 10 
3,000- 1 +5 Fb 25 3 5 
3,300- 2 +6 +12 72 2 2 
162 — 11 495 105 358 
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With grouped data of this type it is best to use the group interval as the 
working unit and the central value of one of the central groups as the working 
mean. The group $1500-1799 (central value, $1649-5, since the data were 
rounded off to the nearest dollar before grouping) has been chosen. The 
calculation of the total and sum of squares in these units is shown in columns 4 
and 5 of the table. The mean of the sample in working units is therefore 
— 11/162, i.e. — 0-06790, and in the proper units is 

1649-5 — 0-06790 x 300 = 1629-1 
The sum of squares of the deviations in the working units is 
495 — 0-06790 x 11 = 494-25 


and in the proper units is therefore 494-25 x 3002, i.e. 44,482,000. Hence, 
dividing by 161, $ = 276,290, and the sampling standard error of the mean 
income is +/(276,290/162) = + 41-3. 

If no calculating machine, or only an adding machine, is available the 
alternative form of calculation shown in columns 6 and 7 may be preferred. 
Column 6 is formed from column 2 by successive summation from the ends. 
Column 7 is similarly formed from column 6. Note the check 
73 + 57+ 32 = 162 for column 6, and the checks of the final values for 
column 7 from the totals of column 6. The total in working units is then 
given by the difference of the totals of the two halves of column 6, i.e. by 
105 — 116 = — 11. The sum of squares is obtained by doubling the total 
of column 7 and deducting the sum of the totals of the two halves of column 6, 


i.e. by 2 X 358 — 105 — 116 = 495. 


7.2 Sampling from a finite population 


The above theory requires modification in two respects if the population 
lace it is best to define o? as 


is not large. In the first p 
1 =\2 

Are 7 Sp (9 — 9° 
where Sp denotes summation over the whole population. This is equivalent 


to regarding the population as itself a random sample from an infinitely large 


population with variance o. 
s2 stands without modification. 
the alternative definition with 
factor (N — 1)/N into the form aie 
in the discussion of the errors of sampling fr 


i f the first definition. À 
RA Ales the formula for the standard error of the estimate of 


the mean or other estimate requires modification by the introduction of the 
factor »/(1 — f ), or more strictly (1 — f). Thus we have 


SE. (7) aa =! (7.2) 
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That the introduction of some factor of this kind is necessary is obvious, since 
if the whole population is included in the sample (f = 1) the sampling error 
will be zero. ‘The actual factor can be deduced by an extension of the algebraic 
analysis given above. 

It should be noted that the factor +/(1 —f) should not be introduced 
when testing the difference between the means of two sampled populations 
to see whether, for example, they are subject to different causal agencies. 
In this case we are concerned to determine whether there is a real and consistent 
difference running through all the units of the two populations: in other 
words, we wish to test whether the two samples can reasonably be regarded 
as random samples from a single infinitely large parent population, or whether 
they have to be regarded as samples of two different parent populations. 


Example 7.2.a 
Estimate the standard errors applicable to the estimates obtained in 
Example 6.4.b. 


From Example 7.1.a, s? = 4-274, and consequently 


S.E. (7) Vl =) = +0-453 


and since 
Y = 500Y 


S.E. (Y) = 500 x 0-453 = + 296 
S.E. (Y’) = 507 x 0-453 = + 230 


Example 7.2.6 


Estimate the sampling error of the wheat acreage from the random sample 
of 125 farms of Table 6.6.a. 

We find : 

n = 125 S (y) = 2301 J = 18-4080 
S (3°) = 207,261 S (y — F)? = 164,904 s? = 164,904/124 = 1309.9 

S.E. (¥) = +/{1329-9 (1 — 45)/125} = + 3-18 

S.E. (Y) = 20,/{125 x 1329-9 (1 — ¥5)} = 20 x 125 x 3-18 = + 7950 

The calculation of the mean and of the sum of squares of deviations may 
alternatively be carried out by grouping the data. The groups should be so 
chosen that the distribution within any group containing a substantial number 
of values is reasonably even. A grouping interval of 10 acres is here convenient, 
but because of the large number of zeros these must be included in a separate 
one grouped data and the calculation of the total and sum of Squares are 
shown in Table 7.2. The calculations are carried out in terms of working 
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values in units of the grouping interval, and a working mean of 24-5 (the mean 
of group 4) is taken. The total is obtained from column 4 and the sum of 
squares from column 5. The mean in terms of the grouping interval is therefore 
(— 209-75 + 136)/125 = — 73-75/125 = — 0-590 
and in terms of the proper units is 
y = 24-5 — 0-590 x 10 = 18-60 


Similarly 
s? = (1719-2 — 73-75 x 0-590) x 102/124 = 167,570/124 = 1351-4 
The rest of the computations proceed as before. 


TABLE 7,.2—CALCULATION OF THE MEAN AND VARIANCE FROM GROUPED DATA 
(WHEAT ACREAGES OF TABLE 6.6.a) 


Acres Number Working (2) 3608) eo er 
0 80 — 2-45 — 196-00 480-2 
1- — 1-95 = 9-75 19-0 
10- — 1 — 4 4 
20- 11 0 = 209-75 
30- 2 +1 aS, 2 
40- 2 +2 TEA 8 
50- 4 +3 + 12 36 
60- 4 +4 + 16 64 
70- 4 0) + 20 100 
80- 3 + 6 + 18 108 
90- 1 ap ap a 49 
100- 3 + 8 + 24 192 

110- 1 +9 +9 81 

fe 1 + 24 + 24 576 

125 + 136 1719-2 


If the data are fully tabulated, grouping is scarcely worth while for so 
small a body of data, even when a calculating machine is not ayailable— 
especially when, as here, the existence of zeros complicates the grouping while 
simplifying the direct calculation of the sum of squares. With material E 
punched cards, however, the data can be most easily and compactly presente 
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in grouped form, and the advantages of this form of computation therefore 
become much greater, particularly when the number of values is large. 
Grouping also enables the form of the distribution to be much more easily 
comprehended. In the present data the relatively large number of values 
between 20 and 30, and the single very high value of 265, are immediately 
apparent. 


7.3 The normal law of error 


The above analysis shows that it is possible, from the numerical values 
of the selected sampling units, to estimate the standard error of the estimate 
of the mean of the population. This gives us a measure of the average error 
to be expected. The analysis has not, however, given us any indication of 
the frequency with which errors of different magnitudes may be expected to 
occur. 

It is a matter of common observation that in most material which is subject 
to quantitative variation large deviations tend to occur less frequently than 
do small deviations. In much material, also, positive and negative deviations 
occur with about equal frequency. The exact distribution of the deviations 
of individual sampling units will, of course, vary considerably in different 
types of material, but it is a fortunate circumstance that, over a wide range 
of distributions of the parent material, the errors to which estimates such as 
the mean, total, etc., are subject are distributed approximately according to 
what is known as the normal law of error, i.e. in a normal distribution. Other 
things being equal, the larger the sample on which the estimate is based, the 
more closely is the law followed. If the deviations of the original material 
are normally distributed, the errors of the estimate of the mean, etc., will 
conform exactly to a normal distribution. 

In a normal distribution the frequency with which deviations within the 
infinitesimal range x to z+ dz may be expected to occur is given by the 
expression : 


1 2/962 Jy 
f(z) dz = ORT [20° dz 


where c is the standard deviation, and e is the base of Napierian logarithms, 
2-71828 approximately. 

Fig. 7.3 shows normal distributions with standard deviations o = 1 and 
o=2. The vertical scale represents the frequency with which deviations 
within a range of 0-1 of z occur per 1000 values. Thus the ordinate at z = 0 
for o = 1 is 39-9, which indicates that on the average 39:9 values per 1000 
may be expected to have deviates having values between — 0-05 and +- 0-05, 
The value 39-9 can be derived from the above formula by putting o = 1, 
z = 0, dz = 0:1 and multiplying by 1000. 

From the figure it will be seen that positive and negative deviations of a 
given magnitude occur with equal frequency, and that large deviations are 
much less frequent than small ones. We are, however, in general not so much 
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concerned with the frequency of a deviation of any particular magnitude, as 
with the frequency with which deviations greater than a given magnitude 
may be expected to occur. These latter frequencies, which correspond to 
areas in Fig. 7.3, are shown in Table A.2 at the end of the book, for various 
values of z/s. The area for z/s = l is shaded for both curves. 

From Table A.2 it will be seen that 61-7 per cent. of all values have a 
deviation or error (positive or negative) greater than one-half the standard devia- 
tion or standard error, 31-7 per cent. of all values have a deviation greater 
than the standard deviation, but only 4-6 per cent. have a deviation greater than 
twice the standard deviation, and only 0-27 per cent. will have a deviation 
greater than three times the standard deviation. Consequently, if we know 
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Fic. 7.3—NORMAL FREQUENCY DISTRIBUTIONS WITH STANDARD DEVIATIONS o = l 
AND o = 2. THE FLATTER CURVE IS THAT FOR o = 2. 


The shaded areas represent the total frequencies of the values for which the actual 
deviations are greater than the standard deviation. 


that an estimate is subject to the normal law of error and has a given standard 
error, we can assign probable limits of error to this estimate. If limits of plus 
or minus twice the standard error are taken then in only 4-6 per cent. of the 
cases will the actual error lie outside these limits. 

An alternative form of this statement, utilizing fiducial probability, which 


has certain logical advantages, is as follows. z ; j 
If the true value of the mean is equal to the estimate minus twice the 


standard deviation (the lower limit of error) then a value of the estimate as 
high as, or higher than, that actually observed will be obtained in only 2-3 per 
cent. of all samples. (The values of Table A.2 are divided by 2 since deviations 
in only one direction are involved.) Similarly, if the true value of the mean 
is equal to the estimate plus twice the standard error (the upper limit of error), 
a value of the estimate as low as, or lower than, that actually observed will be 
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obtained in only 2-3 per cent. of all samples. For limits of plus or minus 
once the standard error the corresponding percentages are 16 per cent. These 
closer limits are useful as indicating the region within which, or in the fairly 
close neighbourhood of which, the mean is likely to lie. 

As pointed out above, we usually only have an estimate of the standard 
error, which is itself subject to error, the accuracy being dependent on the 
number of degrees of freedom, and statements in the above form are therefore 
not exact. 

The effect of inaccuracy in the estimate of error on fiducial statements can 
be allowed for by the use of what is known as the t distribution, instead of the 
normal distribution. In general, however, inaccuracies due to paucity of data 
are not sufficiently great for this to be necessary in census work. More important 
is the fact that with a number of types of sample frequently employed, e.g. 
systematic samples and stratified samples with one unit from each stratum, 
fully valid estimates of error are not available. The estimates of error actually 
obtained in such cases are usually overestimates of the sampling standard 
errors, and any exact fiducial statement is therefore impossible. 

In a random sample of n (n —1 degrees of freedom) from a normal 
distribution with standard deviation o the standard error of s is given by 


co 
S.E. (s) TAIGN (7.3) 
If the material conforms approximately to the normal law of variation, an 
estimate based on 50 degrees of freedom will therefore determine the sampling 
error with a standard error of 10 per cent. and an estimate based on 200 degrees 
of freedom with a standard error of 5 per cent. If the material does not conform 
to the normal law the accuracy may be substantially less. 


Example 7.3 


Assign limits of error to the estimate of the mean of the population (assumed 
large) of the values of Table 6.4. (The values are actually a random sample 
from a normal distribution with mean 10 and standard deviation 2.) Show 
that they are distributed in the expected manner. Calculate also the standard 
error of the estimate of the standard deviation. 


The results obtained in Example 7.1.a indicate that the true value of the 
mean is not likely to lie far outside the range 9:69 + 0-46, i.e. 9-23-10-15, 
and is fairly certain to lie within the range 9-69 +2 x 0-46, ze. 8-77-10-61. 

The distribution of the 20 values is given in Table 7.3. Integral values 
have been allotted 4 and } to the two appropriate classes. Each interval in 
this grouping is equal to 0-5c. From Table A.2 the proportionate frequency 
of « bservations with deviations greater than 0-56 (positive or negative) is 
0-6171, and consequently the expected frequencies of observations between 
9 and 10 and between 10 and 11 are each 20 xe(l — 0-6171) = 3-83. 
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TABLE 7.3—OBSERVED AND EXPECTED FREQUENCIES IN THE SAMPLE 
or TABLE 6.4 FROM A NORMAL DISTRIBUTION 
Range . <6 6-7 7-8 8-9 9-10 10-11 11-12 12-13 13-14 14-15 >15 Total 
Observed 0 1 25 & 25 65 1 0-5 1 1 0 20 
Expected 0:45 0-88 1-84 3:00 3-83 3-83 3-00 1-84 0-88 0-33 0-12 20-00 


Similarly the proportionate frequency of observations greater than + 1-0 o 
is 0-3173, and consequently the expected frequencies between 8 and 9 and 
between 11 and 12 are each 20 x 4 (0-6171 — 0-3173) = 3-00. In this manner 
all the expected frequencies shown in Table 7.3 can be calculated. The 
observed frequencies conform satisfactorily to the expected frequencies. 
From formula 7.3 the standard error of the estimate s is 
2 
S.E. (s) = Vx 19) = 


The actual value of s, 2-07, is therefore closer to the true value than will occur 


+ 0-324 


on the average in samples of 20. 


7.4 Qualitative variates 

From the procedure developed for random samples it will be seen that 
the estimation of the sampling errors of estimates derived from a quantitative 
variate can be divided into two distinct stages, the first being the estimation 
of the variability of the individual sampling units (or more strictly the part 
of the variability which contributes to sampling error), and the second the 
derivation of the standard errors of the estimates in terms of the variability 
of the individual sampling units. s ; 

The same principles hold when the variate under consideration is 
qualitative. In the case of a random sample, however, the variability of an 
attribute of the sampling units depends only on the proportion of units 
possessing the attribute in the population. Hence in random samples no 
estimate of the variability of the individual sampling units is required. 

For a random sample from a large population, if q = 1 — p, we have 


v=% SEL(p) = N 


If the population is finite and the sampling fraction f is appreciable the formula 
becomes, to all necessary accuracy, 

S.E. (p) = v {pq (1 — f )/”} 
The exact expression is obtained by replacing (1 — f) by (N — n)/(N — 1). 


imilarl 
Similarly TOES a-f) 


and hence SE: (Ù) = gv{npa (1 — f )} = N{ SE. (p)} 
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Fig. 7.4 shows the way in which the standard error of the estimated 
percentage S.E. (100 p) varies with the percentage 100 p in samples of 100 
and 1000 from a large population. The actual values of S.E. (100 p) are shown 
by the full line, while the dotted line gives the percentage standard error 
100 S.E. (100 p)/100 p. 
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PERCENTAGE IN POPULATION, 100 p 


Fic. 7.4—-STANDARD ERRORS OF THE ESTIMATED PERCENTAGE OF UNITS HAVING A GIVEN 
ATTRIBUTE, AND OF THE ESTIMATED NUMBER HAVING THE ATTRIBUTE, FOR DIFFERENT 
PERCENTAGES OF UNITS HAVING THE GIVEN ATTRIBUTE IN THE POPULATION 


The full line shows the actual standard error of the estimated percentage, and the 
broken line the percentage standard error of the estimated number. This is equal 
to the percentage standard error of the estimated percentage. The scales shown are 
for samples of 100 and 1000. For a sample of 10,000 divide the values of the left-hand 
scale by 10, etc. 


The standard errors obtained with larger samples for which the sample 
number is a power of 10 can also be read from the figure by dividing one of 
the scales by the appropriate power of 10. Thus for a sample of 10,000 the 
scale for the sample of 100 is divided by 10, since +/(10,000/100) = 10. 

The actual standard error has its maximum value at p=0-5. At this 
point the standard error of the estimated percentage with a sample of 100 
is 5-0, and with a sample of 1000 is 1-58, i.e. if the true percentage is 50 per 
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cent. the value of the estimated percentage will usually lie between 40 per cent. 
and 60 per cent. with a sample of 100, and between 47 per cent. and 53 per 
cent. with a sample of 1000. Expressed in percentage terms the standard 
errors and limits at this point are double the above values. As the percentage 
in the population decreases from 50 per cent. the actual standard error of the 
estimated percentage also decreases, but the percentage standard error continues 
to increase. With 100 p = 20 per cent. the actual standard error with a sample 
of 100 is 4-0 and the percentage standard error 20 per cent. ; with 100 p = 5 per 
cent. they are 2-2 and 44 per cent. respectively. Thus, while quite a small sample 
serves to verify that the proportion in a population possessing a given attribute 
is small, the determination with any accuracy of the actual number possessing 
the attribute requires a relatively large sample when the proportion is small. 

In estimating the sampling error the proportion in the population p can 
be replaced by its estimate p from the sample, 1.e. by the proportion in the 
sample. This results in a certain amount of error in the estimate of variability, 
since the proportion in the sample will not in general be exactly equal to that 
in the population, but in large samples, such as are commonly met with in 
zensus work, this is not likely to be of much importance. Exact treatment 
of the problem is possible by use of Table VIII.1 of Statistical Tables for 
Biological, Agricultural and Medical Research. 

It must be clearly recognized that the above formule hold only when the 
units of which the proportion possessing a given attribute is being assessed 
are themselves the sampling units, and the sample is a random one from the 
whole population. In a stratified random sample the formulze apply to each 
stratum taken separately. In other cases, e.g. multi-stage sampling, and all 
types of sampling with supplementary information, the variability no longer 
depends only on the proportions in the population or strata. Thus, for 
example, the formule are not applicable to the proportion of farms growing 
a given crop when two-stage sampling by administrative districts, and by 
farms within selected districts, has been carried out. Equally they are not 
applicable to the proportion of individuals of a given race ın a human population, 
when the sampling has been by households ; since the whole of a household 
is usually of the same race, the ae ep will clearly be greater than if a 

‘dividuals had been taken. 
seniora ampl te do not hold, the variability of the individual 
sampling units must be assessed in the same manner as with a quantitative 
variate, scoring the qualitative variate 1 or 0. (See Example 7.8.b.) 


Example 7.4 
Estimate the sampling errors 
We have p = 0:0738, q =! — 0:0738 


0:0738 x 0:9262 (1 RL j- an 
eS aa a sE —— 


— 100 x 0-00281 = + 0-281 
S.E. (U) = 50 x 8491 x 0-00281 = + 1190 
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of the estimates of Example 6.4.a. 
= 0:9262. Hence 


S.E. (percentage defective) 
S.E. (total number defective) = 
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Thus the percentage defective, 7-38 per cent., has a standard error of 
+ 0-28, which implies, taking limits of plus or minus twice the standard error, 
that the true percentage defective probably lies between 6:8 per cent. and 
7-9 per cent. Similarly the number defective probably lies between 29,000 
and 33,700. Note that the standard error expressed as a percentage of the 
percentage defective or number defective, ze. what is ordinarily called the 
percentage standard error, is 

S.E. % (p) = SE. % (U) = Ta X 100 = zF asg X 100 = 4.38 per cent. 


These standard errors are likely to be slight overestimates, since the sample 
was in fact systematic. 


7.5 Standard errors of functions of estimates 


If we have a number of estimates Y}, Yo, Ya with sampling errors 


which are independent, the sampling variances being V(y,), V (yə), V (ys), 
A and we form a linear function of the y’s: 


L=hyithyot ly Ys + 


where the Ps are any multipliers whose values are not influenced by the 
sampling, the sampling variance of L is given by 


V(L) = L? V (y1) + Lè V (ya) + L? V (ys) + 
The condition of independence is important. The sampling errors of two 
estimates will be independent if the estimates are derived from sets of values 
which are themselyes independent. Estimates derived from samples of 
different populations, or from different strata of the same population, are 
consequently independent, as are estimates derived from different samples of 
a large population. Estimates derived from two different variates belonging 
to the same sampling units are not in general independent, since such variates 
are likely to be correlated, high values of the one being associated with high 
(or low) values of the other in the sa 


3 me sampling units. 
A number of important simple fo 


general formula. 
The standard error of a multi 
standard error of the estimate : 


(7.5.a) 


rmulz are derivable from the above 


ple of an estimate is the same multiple of the 


V (y) =P V(y,) (7.5.b) 
S.E. (ly,) = L S.E. Y) 

This formula has already been used in Section Tal, 
The standard error of the difference of two inde 
square root of the sum of the squares of the standar. 
V (Y1 — Y2) = V (y1) + V (ys) (7.5.c) 

S.E. (yY, — yo) = V[{SE. (y,)}? + {S.E. (Y2)}"] 
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The standard error of the sum of a number of independent estimates is the 
square root of the sum of the squares of the standard errors of the estimates : 

Vit yet Yat +++) = V(r) + V (ye) + V(¥s) + + (7.5.d) 
S.E. (yi + Ya + Ya + --.) = VISE. (Y) + {SE (y) } + (SE. Y) + ---] 
which may be expressed by the rule that “ variances are additive.” 

The standard error of the estimate of the mean of a large population can 
also be derived from the formula. 

Weighted means are a type of linear function which occurs frequently in 
statistics. The general form of a weighted mean is 


<M Yi 2 Yat eee 
NS Sag tee ee 


where the w’s are the weights. Knowing the variances of the y’s, the variance 
of Jw can be calculated, provided the y’s are independent. Two cases are of 
frequent occurrence. 


(1) Vi) =YV y) = + =V) 
We then have = 
a S (w? A 
V (Fw) = BOF (y) (7.5.e) 


(2) V (yı) = Ajo, etc., where / is a constant. 
We then have 
z Siw), 2.4 a 
VO = tsa) T S Gsi 


This is the form of weighted mean which is used when we wish to obtain 
the most accurate combined estimate from a number of independent estimates 
of the same quantity whose relative variances are known. The weights are 
taken equal (or proportional) to the reciprocals of the variances, and the variance 
of the weighted mean is given by the reciprocal of the sum of the weights 
(or a multiple of this reciprocal). 

A further type of weighted mean is that in which the weights w are in the 
nature of supplementary information, the quantities y and w both being 
determined from the individual sampling units, with the variances of the y’s 
related in some unknown manner to the w’s, and jw = S (w y)/S (w). 

In order to obtain an unbiased estimate of V (jw), whatever the variance 
law, the squares of the deviations of the y from Fw must be weighted in proportion 
to w? before summation. For a random sample, if 


Q = Su? (y — Fw)? 


and 
we have T 
V (Jw) -Cp (7.5.8) 
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It may also be noted that if the variance of y for given w can be regarded 
as constant over the range of w, and there is also no variation in the mean 
value of y for given w over the range other than that ascribable to random 
variation in y, the efficient estimate of the variance of y is given by the ordinary 
formula 


Vy) = S(y — Fln — 1) (7.5.h) 


and formula 7.5.e may be used to estimate V (Fw), with the introduction of 
the factor (1 — f). If the variance of y for given w is inversely proportional 
to w then the efficient estimate of the variance of y is given by 


i Fy)? 2 — Fy Swy } 
QO! = Sw (y — Jw)? = Swy ei, (1.5.3) 


sq? = Q'/(n — 1) 
sq? is an estimate of 2, and formula 7.5.f can be used for estimating V (Jw), 
with the introduction of the factor (1 — f). Either of these estimates will 
be biased if the true variance law is different from that assumed or the other 
condition does not hold. They should therefore not be used without careful 
consideration. 

The mean ratio F used in the ratio method of estimation is an example 
of a weighted mean of the above type, since F = S ( y)/S (x) = S (xr)/S (x), 
and we therefore substitute 7 for y and x for w. This case is discussed in more 
detail in Sections 7.8-7.11, which deal with the estimation of errors in the 
ratio method in both random and stratified samples. Normally formula 7.5.g 
will be used to estimate V (F), but under certain circumstances formule 7.5.h 
and 7.5.e may be employed. 

The approximate formule for the standard errors of the product and the 


ratio of two estimates whose sampling errors are independent may also be 
noted. These are given by 


V (17a) = Ye? V (Y1) + ys" V (Yo) (1.5.j) 
WIARE) an 


These formulz are only satisfactory if V(y,) and V (yp) are small relative to 
yı? and y,? respectively. 


If the estimates y,, yo, Ys, .. . are not independent the concept of covariance 
must be introduced. The covariance between two estimates is the mean 
product deviation, and is estimated in exactly the same manner as is the 
variance of each of the estimates, with the exception that the sum of squares 
of the deviations of a single variate is replaced by the sum of products of the 


deviations of the two variates. If the covariance between y, and Y2 is denoted 
by cov (y; Y2) the additional terms 


+ 2hl, cov (y1¥2) + 21,1; cov (y1Y3) + oly cov (yaya) +... 
must be introduced into the formula for V(L). This gives the additional 
term —2cov(y,y2) in the formula for V (y; — y). The corresponding 
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additional term in V(y,y2) is + 2y1¥2 cov (yiy2), and that in V(y,/y2) is 
— 2 cov (y,y2)/y1¥2 within the bracket. 

If Yi Yay Ys, » + - are derived from different variates belonging to the same 
sampling units, e.g. measurements of different characters, the variance of any 
linear function L can, if desired, be estimated directly by calculating a value 
L for each sampling unit separately and estimating V (L) from these values 
in the manner appropriate to a single variate. This obviates the calculation 
of the variances and covariances of the individual variates. The same method 
can be followed with products and ratios, subject to the same limitations as 
those given above for formule 7.5.e and 7.5.f. If the errors of a number 
of functions are required, however, it is best to calculate the variances and 
covariances (Example 7.8.b). 

The regression and correlation coefficients can be expressed in terms of 
the variances and covariance. We have the relations b = cov (xy)/V (x), 
and r = cov (ay)/+/{V (x) - V (9) }- 

In the more complicated types of sampling, discussed later, the estimation 
of covariance is again exactly parallel to the estimation of the corresponding 
variances, the squares being replaced by products wherever they occur. 


Example 7.5 


Calculate standard errors for the various estimates of the regional and 
varietal differences between the yields of potatoes given in Tables 5.23.c, 
5.23.d, 5.23.e and 5.24, given that the variance of the yield per acre of any 
one variety in any one region is 4-22, and that the standard deviation is 
therefore + 2-05. 


The standard errors of the regional-varietal means of Table 5.23.b are 
obtained by dividing the above standard deviation by the square roots of the 
numbers of fields. Thus for Majestic in Scotland the standard error is 
2-05/4/37 = + 0-34. The standard errors are shown in Table 7.5. 


TABLE 7.5—POTATO SURVEY : STANDARD ERRORS OF REGIONAL-VARIETAL MEANS 


Scotland North | E. Midlands} South West 
Majestic . + 0-34 + 0-24 + 0-20 + 0:20 + 0-24 
King Edward . 4 0-32 + 0-55 + 0-22 + 0-25 +0-31 
Great Scot į + 0-48 + 0-55 — + 0-84 + 0-48 
Arran Banner . 40:72 + 0:33 — + 0-68 + 0:38 
Kerr’s Pink , + 0-25 + 0°34 = — + 0:57 
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These standard errors enable the differences between the individual means 
to be examined more critically. The difference between Scotland and the 
Northern region for Arran Banner, for example, is at first sight anomalous, 
being — 0-12. The standard error of this difference is »/(0-72° + 0-33") 
= -+ 0-79. This difference, therefore, does not conflict very seriously with 
the other differences. 

On the other hand the difference between this difference and the largest 
positive difference, that for King Edward, is -+ 1-94 — (— 0-12) = + 2-06. 
This quantity has a standard error of +/(0-32® -+ 0:55? + 0-72? + 0:33?) 
= 4+ 1-02. It might therefore be judged unlikely, on this evidence alone, 
that the difference has arisen by chance, since Table A.2 shows that a difference 
of 2-0 times its standard error would arise by chance in less than 1 in 20 times. 
In statistical terminology the difference is significant at the 1 in 20 level of 
significance. This conclusion, however, is subject to the qualification that 
we have here picked the two extreme differences out of 10 possible pairs. A 
combined test* of all 5 differences shows that they are not exceptionally variable. 
A more comprehensive test of the differences of the whole table confirms , 
this verdict. It may be noted, however, that Arran Banner is only $ as common 
in Scotland as in the Northern region, whereas King Edward is 3 times as 
common in Scotland as in the Northern region. ‘The observed differences 
are therefore in the direction that would be expected if the varieties were 
grown in the regions to which they were most suited. 

The standard errors of the means of Table 5.23.c may be calculated in 
a similar manner, that for mean (a) of Scotland for example being 
4y/ (0-342 + 0:322 + 0-48? + 0-72? + 0-252) = + 0-20. The corresponding 
value for the Northern region is + 0-19. The standard error of the estimated 
difference 8-55 — 7-46 = + 1-09 is therefore +/(0-20? + 0-19*) = + 0-28. 
Similarly the standard error for the corresponding difference of the means (b), 
8-98 — 7-37 = + 1-61, is + 0-38. 

Table 5.23.d provides an example of a weighted mean with the weights 
so chosen that the most accurate combined estimate is obtained. Formula 7.5.f 
is therefore appropriate, and 2 represents the variance of a single field, i.e. 
4 = 4-22. Hence the variance of the weighted mean difference = 4:22/74 
= 0:0570, and the standard error is therefore + 0-24. 

The relative efficiency of the above estimates of the differences may be 
assessed from the ratio of the reciprocals of the variances (Section 8.1). 
Assigning a value of 100 to the weighted mean, the relative efficiencies of 
means (a) and (b) are 73-5 and 39-9.+ 

Finally we may evaluate the standard errors of Table 5.23.e. These 
cannot be evaluated exactly, as the pooling of regions is based on the assumption 
that there are no differences of any importance between these regions. In so 


* The weighted sum of squares of deviations gives x? = 5-74 with 4 degrees of 
freedom. 

+ These values do not represent the true efficiencies, which are obtained by assigning 
the value of 100 to the most efficient possible method, here that of Table 5.24. 
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far as this is not the case additional errors will be introduced, and the standard 
errors calculated on the assumption that there are no differences will therefore 
be underestimates of the true errors. 

The standard errors of the means of the pooled regions can be calculated 
from the numbers of fields on which each mean is based. Thus that for 
Majestic is /(4-22/356) = + 0-11. The standard errors of the Scottish 
means have already been given in Table 7.5. The standard error of the 
weighted mean for Majestic is therefore given by 


2 x 0:112 + 12 x 0°34? 
(7 x0 + ) =+00 


Similarly the standard errors for the other four varieties are found to be 
+ 0-13, + 0-29, + 0-24 and + 0-24. 

The standard errors of the estimates obtained in Table 5.24 can only 
be calculated exactly by inversion of the matrix of the simultaneous linear 
equations giving the least-squares solution. This requires a good deal of 
arithmetical work. The method is explained, for example, in Statistical 
Methods for Research Workers, Section 29. 

In material of this type, however, there will rarely be any need to determine 
the standard errors exactly. An upper limit to the standard error of any 
particular difference can be obtained by calculating the standard error of the 
estimate given by Method (3) of Section 5.23. A lower limit can be obtained 
by calculating what the standard error would be if there were no cross 
classification, and if the relevant variance per unit were that within sub-classes. 
The value of this latter standard error for the difference between Scotland and 
the Northern region, for example, is 


= li Ls ai) — 99 
2-054 | i74 + 777) = HO? 


The value of the standard error for Method (3) has already been found to be 
+ 0-24, In this case close limits are set to the true standard error. 


7.6 Stratified random sample with possibly unequal variances within 
strata 


Since in a stratified sample differences between the sampling units in the 
different strata are eliminated from the sampling error, in estimating this error 
we require not the total variance of the sampling units over the whole population, 
but the variances of the sampling units within the different strata. 

A simple example will illustrate the difference. Suppose we have a large 
population of which 25 per cent. of the units have the value 8, 50 per cent. 
have the value 10, and 25 per cent. have the value 12. The mean of the 
population is 10, and 50 per cent. of the values have a deviation of 0 from the 
mean, the remaining 50 per cent. having a deviation of +2. The mean 
square deviation or total variance is therefore 0:5 x 0+0-5 x 2 = 2. 
Suppose now the population is divided into two strata, the first containing all 
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the units with value 8 and one-half the units with value 10, the second one- 
half the units with value 10 and all the units with value 12. The mean of 
the first stratum is 9, and all units in it have a deviation of + 1. The mean 
square deviation or variance within the first stratum is therefore 1. The same 
holds for the second stratum. 

In this example the strata are of the same size and the within-strata variances 
are equal. Examples can easily be constructed in which this is not the case, 
but even if the variances are unequal an average within-strata variance can be 
calculated, and in a large population this will always be less than the total 
variance if there are differences between the strata means. 

If the sample numbers in all strata are sufficiently large for the within-strata 
variances per sampling unit to be separately estimated, the sampling variances 
of the means or totals of the individual strata can be estimated separately, and 
the sampling variance of the population mean or total, which is a linear function 
of these means or totals, can be obtained by the use of the formule of 
Section 7.5. 

This method is valid even if there is inequality in the within-stratum 
variance per sampling unit from stratum to stratum, and is applicable to all 
types of stratification, including stratification with a variable sampling fraction 
and stratification after selection. 

In general it is best to build up the variance of the population estimate 
under consideration by calculating the variances of the component parts, and 
adding these variances, or the correct multiples of them, the same steps being 
followed as in the calculation of the estimate itself. We will therefore not 
give formule for the variances of all the different estimates set out in Sections 
6.4 and 6.5, but will illustrate the derivation of such formule by obtaining 
that for V (7) in the case of a variable sampling fraction. 

We have ¥ = X{gi Si(y)}/N. If oi? is the variance within the ith stratum, 
V {Si (y)} = nioi? (1 — fi), and hence by formula 7.5.a 

V (F) = {gi ni oi? (1 — fi) }/N* (7.6.a) 
For V (y) the o; will be replaced by their estimates s;?. 

If all the sampling fractions are equal we have, since N = gn, 

V (9) = (1 — f) 2 (m s2)/nt 7.6.b) 

The examples which follow will illustrate the details of the methods to be 
followed in the cases that are ordinarily met with in practice. 

If we require the sampling errors of estimates applicable to domains of 
study which cut across the strata the situation is more complicated. Ifa dash 
is used to denote the domain in question, and if s; is the estimated variance 
per unit between the units of this domain in stratum 7, and the proportion 
of the selected units of stratum z not in the domain is q;’, so that gi’ = (ni — ni’) /ni, 
estimates of the variances of the total and mean of the domain are given by 
V (Y^) = Ègë n (1 —fi) {m Gi’ Ji” + (m — 1) 81°} /(m — 1) (7.6.c) 
NEV (J) => gi? mi (1 fi) {m gi’ (Fi — 7P + (m 1) si? 3u —1) (7.6 .d) 

Consequently, in the case of a stratified sample with uniform sampling 
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fraction, an approximate estimate of the sampling error of a domain mean 
will be obtained by treating the sample as if it were stratified for the domain 
in question, and not stratified for the strata which cut across the domain. 
(See Example 7.7.b.) 

In the case of a variable sampling fraction the above formule must be 
used. In addition the variance of the difference of the means of two domains 
will be somewhat increased by covariance. It may be noted that, although 
the variances of estimates for such domains may be considerably increased by 
lack of control by stratification, a variable sampling fraction will still be 
advantageous in the appropriate circumstances. The optimal sampling fractions 
will be dependent on the estimates required, being approximately proportional 
to the square roots of 1/(; — 1) times the quantities in the curly brackets.* 


Example 7 .6.a 
Estimate the sampling errors of the estimates of Example 6.5. 


The computations for acreages are set out in Table 7.6.a. The various 
steps are as follows. Ee, 
The sums of squares of deviations from each stratum mean, Si Or Fi), 
are first calculated from the sample values given in Table 6.5.a, using the 
method of Example 7.1.a. The estimated within-strata variances s;* are 
then calculated by dividing by n; — 1. Multiplying these by (1 — f )/ni gives 


TABLE 7.6.a—ESTIMATION OF SAMPLING ERRORS OF THE WHEAT ACREAGES OF 
EXAMPLE 6.5 


7 | | 
Bey | om [m=i Ssa 8 vi) | SEV |Y fsa) vive’) 
eee fecal j | j —— 
l- 22 21 0 
ë- 96 æ | 47 1-9 069 | + 0-26 47 | 19,000 
21- 18 7 | 191 | 11-2 | -593 ! + 077) 76,000 
51- 26 | 25 4,051 j 162-1 | 5:92 b 2-43 i 1,595,000 
151- 20 19 | 13,899 ; 731-5 | 3475 | + 5-89 | 5,560,000 
301- 133 | 12 23,370 i 1947-5 | 1423 | £ 11-93 | 10,069,000 
125 | 19 | | 42,193 | 2789-000 


the variances of the strata means V (i), and taking the square roots gives 
the sampling standard errors S.E. (yi) of these means. The means themselves 
are tabulated in Table 6.5.b. 


* The sampling errors of contrasts between domains are more fully discussed in 
Sections 9.2-9.4, 
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To obtain the sampling errors of the population estimates we must bear 
in mind the method by which these estimates were arrived at. The estimate Y 
of the population total was obtained by multiplying the sample total by the 
raising factor. We therefore require the variance of the sample total 2011. 
This will be equal to the sum of the variances of the strata totals. These 
variances V {Sj (y)} are obtained by multiplying si? by m(1—f). The 
square root of the sum of the variances is then taken and multiplied by 20. 
Thus 

S.E. (Y) = 20 x +/42,193 = 20 x 205-41 = + 4110 
Similarly 
S.E. (¥) = 205-41/125 = + 1-64 

In the case of Y’ each V (¥i) is multiplied by the square of the number of 

farms in the size-group, given in Table 6.5.b. Thus 
142-3 x 266% = 10,069,000 
Taking the square root of the sum, 
S.E. (Y’) = /17,319,000 = + 4160 
and hence 
S.E. (y’) = 4160/2496 = + 1-67 

It will be noted that although Y’ is slightly more accurate than Y the 
standard error given by the above calculation is slightly greater. ‘There are two 
reasons for this. In the first place in calculating S.E. (Y) we have neglected 
the errors introduced by the use of a working sampling fraction. The average 
errors from this cause are equivalent to errors introduced by rounding off 
the numbers in the different strata of the sample to whole numbers. In the 
second place, use of the exact sampling fractions will result in slightly different 
contributions to the error variance from the different strata, which may result 
in raising or lowering the estimate of the standard error. 

The computations for number of farms growing wheat follow the same 
pattern, with the exception that the variance of each size-group total V (ui) 
is estimated from the proportion of farms growing wheat in that size-group. 


TABLE 7.6.b—ESTIMATION OF THE SAMPLING ERROR OF THE NUMBER OF FARMS 
GROWING WHEAT FROM EXAMPLE 6.5 


Size-group 

(acres) ny U; Pi V (u) 

1-5 22 0 0 0 
6-20 26 1 03846 “913 
21-50 18 5 27778 3-431 
51-150 26 21 -80769 3-837 
151-300 20 16 *80000 3-040 
301- 13 11 -84615 1-608 
12-829 
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The calculations for S.E. (U) are given in Table 7.6.b. For the largest size- 
group, for example, 
V (wi) = 13 X 0-84615 x 0-15385 x 19/20 = 1-608 
We then have 
S.E. (U) = 20 x 4/12°83 = + 71-64 
If the sampling errors of the different strata means are not required, the 


above calculations can be simplified slightly by omitting the factor (1 —/) 
till the final stage of the computation. 


Example 7 .6.b 

Estimate the sampling errors of the estimate of wheat acreage and number 
of farms growing wheat obtained by stratification after selection of the random 
sample of Example 6.6. 


The computations follow the same lines as those for S.E. (Y’) in Example 
7.6.a. The values obtained are : 
S.E. (Y’) = + 4820 acres 
S.E. (U’) = + 75-2 farms 


The value for S.E. (Y’) is slightly greater than that for the similar estimate 
of Example 7.6.a. The difference, however, is not a precise measure of the 
relative accuracy of stratification before and after selection. In the first place, 
since different samples are involved, there are differences in the estimates of 
the within-strata variances. In particular, the estimate of the variance for 
size-group 301l- is much greater in the random sample because of the one 
high value 265. These differences are merely errors of estimation and are 
not a reflection of the relative accuracy of the two samples. On the other hand, 
the two largest size-groups, which have the highest variances, happen by. 
chance to have more than the proportionate number of farms in the random 
sample, and this particular random sample will therefore tend to give a more 
accurate estimate with stratification than will a stratified sample. On the 
average, however, random samples stratified after selection will give slightly 
less accurate values than stratified samples. 


7.7 Pooled estimate of error: the analysis of variance 


In a stratified sample with equal sampling fractions all the f; are equal 
If in addition the within-strata variances o;° are equal, by putting = (mi) =” 
we obtain the simplified formula : 
ews 


SE. (7) =%1 N - 


This is the same as the formula for a random sample, with the exception that 
o is replaced by c}, the common within-strata standard deviation. 
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The most accurate estimate of c, will be obtained if the estimates of o;7 
from the various strata are weighted according to the numbers of degrees of 
freedom on which they are based. This is equivalent to adding the sums of 
squares of deviations from the strata means and dividing by the sum of the 
associated degrees of freedom, i.e. by n — t. 

It is worth noting here an identity which relates the above sum of squares 
to the sum of squares of deviations from the general mean. This is 


ZSi(y — Ji)? + Uni (Fi — FP = S(y — FP 
‘This is easily verified if we recognize that 
2 ni (Ji — FP = È {Hi Si (y)} — F S (9) 
The first term on the right-hand side is the sum of the products of the means 
and totals of the separate strata, z.e. the “ corrections for the means ” for the 
separate strata, and the second term is the “ correction for the general mean.” 
The arithmetical computations can be conveniently arranged in the form 


of what is known as the analysis of variance. This is shown schematically in 
Table 7.7.a. 


TABLE 7.7.a—ANALYSIS OF VARIANCE BETWEEN AND WITHIN STRATA 


Degrees of Sum of Mean 
freedom squares square 
Between strata . s b= HS: (y) — yS (y) A 
Within strata . ~n—t IS, (y — y)? B=s? 
Whole sample . n=l S (y*) — YS (9) C=s 


The most convenient form of computation, at least when the numbers 
in the different strata are small, is to calculate the sums of squares for the 
whole sample and between strata, and obtain the sum of squares within strata 
by subtraction. The mean squares are then obtained by division by the degrees 
of freedom. Only the mean square within strata, s,*, is required for the present 
purpose. The mean square for the whole sample approximates closely to an 
estimate s? of the variance per unit that would result from random sampling 
of the whole population. The interpretation of the mean square between 
strata, A, is discussed in Section 8.10. 

A further simplification is possible when each of the strata contains only 
two sampling units. In this case the sum of squares within strata can be 
calculated directly from the differences of the y’s of the pairs of units within 
each stratum. If these differences are denoted by d, the sum of squares will 
be 4 S (d°), and consequently, since there are ¢ differences each contributing 


one degree of freedom, 
s? =} S (®t 
The analysis of variance has many applications, not only in sampling but 
in other fields of statistics. As its name implies, it provides a way of determining. 
206 


ESTIMATION OF THE SAMPLING ERROR SECT. 7.7 


the different components of variance to which a given type of material is 
subject. As such it is of particular value in investigations of the efficiency of 
different types of sampling. Its uses in this connection will be explained in 
Chapter 8. 

The above discussion has been based on the assumption that the within- 
strata variances are equal. In practice this is not likely to be exactly true, 
though the data at our disposal may be insufficient to determine the true 
variance law with any accuracy. It is therefore important to ascertain what 
is the position if a pooled estimate of variance is used when the variances are 
in fact unequal. : 

If all the sampling fractions are equal we have, from equation 7.6.b, 

1—f (nisi) 


n n 


M= 


The second factor is a weighted mean of the si?, with weights equal to ni. In 
the pooled estimate of error described above s? is a weighted mean of the 
si? with weights proportional to mi — ie Unless the numbers in the different 
strata are very small, and associated in magnitude with o;*, there will be little 
difference between the two estimates. Use of the pooled estimate of error 
in the estimation of the error of the population mean and total will not, 
therefore, introduce any serious disturbance when the sampling fractions are 
equal, even when the within-strata variances are very unequal. On the other 
hand, the use of a pooled estimate to determine the errors applicable to the 
mean or total of part of the population, e.g. a single stratum mean, may be very 
misleading, and it is therefore best, when there are marked differences in the 
within-strata variances, and the numbers in the different strata are not too 
small, to keep the error estimates separate, as has been done in Example 7.6.a. 
There is, of course, nothing sacrosanct in weighting by the degrees of 
freedom ; these weights merely give the most accurate estimate when the 
variances are equal, and enable the analysis of variance technique to be used. 
If the ni are too small for separate estimates of the within-strata variances 
to be of value, we can still use a pooled estimate, weighting by n; if this appears 
advisable. P . ` s 
The situation is completely different with a variable sampling fraction. - 
In this case equation 7.6.2 shows that weights proportional to gi* mi (1 — fi) 
are required. The pooled estimate of variance may therefore be decidedly 
misleading, even with quite small differences in the within-strata variances. 
Consequently in this case separate estimates with proper weighting should 


always be used. 


Example 7.7 .a 
f the estimate of the total wheat acreage 


Estimate the sampling errors © 
and number of bet pees wheat from the stratified systematic sample 


with a variable sampling fraction of Example 6.7. 
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Since the systematic selection was from a list arranged by districts the 
sample will substantially be stratified by districts as well as size-groups. The 
numbers in the 6-20 and 21-50 size-groups are too small for the district 
stratification to be effective, but for the larger size-groups the within-districts 
variance is required, instead of the overall variance within the size-group. 

The available number of degrees of freedom in each district and size- 
group is small, and inspection of the values of Table 6.7.a shows that there 
are no marked differences in variability in the different districts. We may 
therefore appropriately use a pooled estimate of variance for each size-group. 

The district totals and means for the largest size-group are shown in 
Table 7.7.b. i 


TABLE 7.7.b—DISTRICT TOTALS AND MEANS FOR SIZE-GROUP 50l- 


District : 1 2 3 4 5 6 7 All 
No. of farms 1 4 2 9 0 1 0 17 
Total . s 114 487 315 1937 — 72 — 2925 
Mean . . 114 121-75 157-5 215-2222 — 72 — 172-0588 


The analysis of variance is given in Table 7.7.c. The sum of squares 
between districts is 114 x 114+ 487 x 121-75 + ...— 2925 x 172-0588 
= 40,698. The total sum of squares is 114? + 1192+ 1072+... — 2925 
172-0588 = 72,067. Subtraction gives the within-district sum of squares, 
and division by the degrees of freedom the mean squares. 


TABLE 7.7.c—ANALYSIS OF VARIANCE BETWEEN AND WITHIN DISTRICTS OF THE 
WHEAT ACREAGES OF SIZE-GROUP 50l- (DATA OF TABLE 6.7.a) 


Degrees of Sum of Mean 
freedom squares square 
Between districts . Zi 4 40,698 10,174 
Within districts . š 12 31,369 2,614 
Whole size-group . z 16 72,067 4,504 


The within-district mean square is substantially less than the overall mean 
square, indicating the greater similarity of farms within a district and consequent 
gain in accuracy by stratification. 

The size-groups 5l-, 151-, and 301- can be analysed in the same manner. 
There is some difference in mean squares for size-group 301-, but little 
difference for the other two size-groups. For size-groups 6- and 2l- the 
overall variability within size-groups can be taken. j 

The remainder of the computations are set out in Table 7.7.d. They 
follow the same lines as Example 7.6.a, with the exception that the factors 
(1 --fi) are different, and that the variance of each group total must -be 
multiplied by the square of the raising factor for that group, and the resultant 
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TABLE 7.7.d—ESTIMATION OF SAMPLING ERRORS OF THE ESTIMATES 
OF EXAMPLE 6.7 


Size-group 
(acres) ni sè ViSi(y) } ge vi) 
1-5 0 

6-20 3 0 0 40,000 0 
21-50 6 53-5 320 3,600 1,152,000 
51-150 26 159-2 3,930 400 1,572,000 
151-300. 40 564-2 20,310 100 2,031,000 
301-500 43 1,703 58,580 25 1,464,000 
501- 17 2,614 29,630 9 267,000 
135 6,486,000 


variances added, in accordance with formula 7.5.a. The estimated standard 
error of the total acreage is thus +/6,486,000 = + 2550, 

It will be noted that the acreage of wheat in the smallest size-group has 
been assumed to be zero, and that the estimated zero error variance of the 
second size-group is based on only two degrees of freedom, and is therefore 
very inaccurately determined. It is clear, however, from the nature of the 
material and the trend of the variances in the larger size-groups that this variance 
must be small. . 

In the computation of the standard error of the number of farms growing 
wheat, allowance should also strictly be made for the stratification by districts. 
If the number of farms in each size-group district sub-class were large this 
could be done by calculating the variance of each size-group total of farms 
growing wheat by the method of Example 7.6.a. The numbers in many of 
the sub-classes are so small, however, that the approximation resulting from 
using the estimated proportions p to calculate the variances will be unsatisfactory. 
In this case it will be sufficient to ignore the district stratification, calculating 
the variance of each size-group total of farms growing wheat from the proportion 
in that size-group, and then proceeding in the same manner as for wheat 
acreage. The resultant standard error will be found to be + 88-9. 


Example 7.7 .b 


Estimate the sampling standard errors of the regional and varietal means 
of the yields of potatoes given in Table 5.23.a, and compare them with the 


Standard errors already obtained in Example 7.5. 


As mentioned in Section 5-23, the sample can be regarded as stratified 
by regions (but not by varieties). The regional standard errors are therefore 
derived from the analysis of variance within and between regions. This is 
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given by lines (1), (4) and (5) of Table 7.7.e. The required standard errors 


are therefore 4/(5-25/174) = + 0-17, etc. 
TABLE 7.7.e—POTATO SURVEY: ANALYSIS OF VARIANCE OF YIELDS PER ACRE 
Degrees of Sum of Mean 
freedom squares square 
Between regions (1) A z ¿ 4 173-7 43-42 
Between varieties (2) å 16 987-7 61-73 
Within Within varieties (3) à 880 3713-6 4:22 
regions — 
Total (4) > : 4 896 4701-3 5-25 
Total (5) : : ; : . 900 4875-0 5:42 
Between varieties (6) $ r ğ 4 887-3 221-82 
Between regions (7) è 16 274-1 17:13 
Within Within regions (8) a 880 3713-6 4:22 
varieties — 
Total (9) s à 4 896 3987-7 4-45 
Total (10) R À ; $ 3 900 4875-0 5:42 


The mean square within regions, 5-25, is 1-24 times the mean square 
within regions and varieties, 4-22, already given in Example 7.5. This latter 
mean square is obtained from an analysis of variance within and between 
the regional—varietal groups, lines (2) and (3). This would have been the 
appropriate mean square for estimating the errors of the regional means if 
the sample had been stratified by regions and varieties. 

The exact standard errors of the varietal means cannot be obtained by any 
simple process. If the sample were fully random, and not stratified by regions, 
the correct estimate would be that given by the within-varieties component 
of variance, i.e. by treating the sample as if it were stratified by varieties, 
Stratification by regions will reduce the sampling error of the varietal means 
slightly, but not to any great extent. Consequently the estimate obtained by 
stratifying by varieties and not by regions will be somewhat of an overestimate 
of the true standard error. 

The analysis of variance within and between varieties (ignoring regions) 
is given by lines (6), (9) and (10) of Table 7.7.e. Approximations to the 
varietal standard errors are therefore given by +/(4-45/393) = + 0-11, etc. 

It should be noted that although the sample is stratified by regions the 
component of variance due to regions must zot be eliminated when calculating 
the standard errors of the varietal means. This must only be done if the sample 
is stratified by both varieties and regions. The reason is as follows. If the 
sample is stratified by both varieties and regions the proportions of fields from 
each region in each varietal mean will be exactly equal to the proportions for 
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that variety in the country. Hence only variation between fields of the same 
variety within each region contributes to the error. In the present case, 
however, the proportions of fields from each region in each varietal mean do 
not correspond exactly to the proportions in the country. The deviations are 
in fact only slightly less than would be obtained in a random sample. 
Those familiar with the use of the analysis of variance in replicated 
experiments may wonder why two separate analyses are required, instead of 
the single analysis analogous to the partition of the degrees of freedom of a 


complete 5 x 5 table into 


Regions... as 
Varieties ... as; 
Regions X varieties 16 


The reason is that regions and varieties are not orthogonal, owing to the differing 
numbers of fields in the different sub-classes. The analysis of variance of 
non-orthogonal material is inherently more complicated, and in particular the 
interaction component, regions X varieties, can only be obtained by rather 
elaborate calculation (Yates, 1934, A). It is not given by the subtraction of 
the sums of squares for regions (1) and varieties (6) from the sum of squares 
for all regional-varietal sub-classes, (1) and (2), or (6) and (7). 

All the components of variance due to the regional—varietal classification 
must be eliminated when calculating the sampling standard errors of varietal 
differences freed from regional effects, or of regional differences freed from 
varietal effects, since we are then concerned only with the component of 
variance within sub-classes. ‘These standard errors have already been discussed 
in Example 7.5. The within-sub-classes component can be obtained by splitting 
the sum of squares within each region into between and within varieties and 
Pooling the components so obtained, as in lines (2) and (8) of Table 7.7.e ; 
by doing the same for regions within varieties, as 1n lines (7) and (8) ; or by 
making a direct analysis between and within regional—varietal sub-classes. 
All three processes are equivalent arithmetically, and give the same sum of 
Squares (880 degrees of freedom) within sub-classes. 

The reader should calculate for himself the various sums of squares (other 
than the total sum of squares) given in Table 7.7.e. These can be obtained 
from the data of Table 5.23.b. For this purpose the means require to be 
recalculated to a greater number of decimal places than those given in 
Table 5.23.b.* i 

It will now be seen that three separate varjances are relevant to the 
calculation of the standard errors appropriate to the various estimates we have 
obtained from the data of Table 5.23 .b. There is also a fundamental difference 
i Those appropriate to the regional means and 


in the nature of the errors. They answer the question : 


varietal means are genuine sampling errors. 


epancies between the means and totals 


+ i s discri 
There are one or two minor last-place struction of the data from a table in 


given in Table 5 due to recon: 
a report. e 5.23.b. These are 
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given the existing distribution of varieties in the different regions, by how 
much may the sample regional (or varietal) means be expected to deviate from 
the corresponding means over all fields? If the sampling fraction were large 
the correction for finite sampling would require to be made. 

In evaluating the errors of the estimates of the regional and varietal 
differences freed from the effects of the other factor, we are concerned with the 
more general question: how far are the estimated differences likely to be in 
error due to chance variations between the fields on which the varieties are 
grown? This is not a sampling error. It would still exist if the data collected 
represented all the potato fields in the country. The correction for finite 
sampling should therefore not be applied. 

It should be noted also that this latter estimate of error is based on the 
assumption that the distribution of the varieties over the different fields within 
a region is random, and that conditions of growth, etc., are also randomly 
distributed. This may be far from the truth. If variety P is regarded, 
rightly or wrongly, as particularly suitable for poor soils, it will tend to be grown 
on poor soils, and its yield will be less for this reason. A new variety will tend 
to be grown by the more progressive farmers, and may in consequence give 
higher yields, even though no better than the older varieties. Consequently 
the estimates of error merely provide lower limits to the real errors ; in other 
words they represent the errors attributable to the residual random component 
of variance only. Consequently, as already emphasized in Section 5.23, all 
conclusions must be tentative. Only experiments can give definite answers. 


7.8 Ratio method: random sample 


In order to calculate the variance of F the correlation between the values 
of x and y for the same sampling unit must be taken into account. From 
formula 7.5.k, with the additional covariance term and allowance for finite 
population, we obtain for a random sample 


1—f a (Y0) _2e0v@) rø) 


n P ay z 


VG) = 


This is an approximate formula, but is accurate enough for practical purposes 
in the cases met with in sampling. 

The formula can be put in an alternative form, which somewhat simplifies 
the approach to more complicated cases such as stratified samples. If we 
denote by Q the sum of the squares of the deviations of the y’s from the values 
given by the ratio line (OMD of Fig. 6.8), we have 

Q = S (y — rx)? 
= S{(y — J) —F (= — 3)? 
= S (y?) — 2F S (xy) + T? S (x?) 
= S (y — F} — WS (x — 2) (y — F) + PS (x — 2} 
the last two expressions being those which are suitable for computation.. 
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If we now take sọ? to represent the estimated mean square deviation from 
the true ratio line, we have 


We then find 


The first of the above formule is equivalent to formula 7.5.g. 

The analogy with the case of a random sample without supplementary 
information can now be seen. Apart from the factor *2/s°, which will be 
approximately unity, the only difference is that the sum of the squares of the 
deviations of y from the mean F of the sample is replaced by the sum of the 
squares of the deviations of y from the ratio line of the sample. 

The variance of a standardized estimate Yo will be obtained by replacing 
x by x, in the above formula. , 

In the case of two-phase sampling the first-phase estimate X, of X will 
be subject to sampling errors. If the sampling is random for both phases the 


Variance of y will be 


1 ms) xe. fis 
a) ee a ——) Se a $: 
v) a ( m 
where the suffices refer to the phases. sq? is calculated as above from the 
Second-phase units, and $° is the total variance of y, also calculated from the 


Second-phase units. : ‘ : 
e above formula will be recognized as the sampling 


The first term of th s 
variance of y due to the second-phase sampling of the first-phase sample 


(regarded as without error), while the second term is the first-phase sampling 
variance of y, i.e. the variance which would be obtained if y were determined 
, dë: 


for all units of the first-phase sample. This subdivision of the variance provides 


a general method of obtaining the errors of two-phase sampling. In certain 
hat values of y are not available for all 


types of sampli i tance t 
ing the circums! ; à Taa 
the ahas nae introduces complications into the estimation of the second 


component of variance which will be dealt with in Section 8.7, 00 

te frequently happens that the available supplementary Moman is out 
of date or otherwise subject to error- If, however, values of x are known for 
all units of the population these values can be used in the calculation of from 
the selected units. If this is done bias will be avoided, and the effect of these 
errors on the final estimates will be correctly assessed, provided the original 
frame is complete. If the frame is not complete the een divides into two 
Parts: that covering units included in the original frame, for which the ratic 
Or regression method of estimation can be used ; and that covering units not 
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so included, for which the appropriate method of estimation without 
supplementary information will be required. 

It may also be noted that if the variance Vx (r) of 7 for fixed x is constant 
over the whole range of values of x, and if r itself exhibits no trend over this 
range, V(r) may be estimated from formule 7.5.h and 7.5.e, substituting 
x for w and r for y. This method of estimation has the advantage of saving 
computation in cases in which the values of r are directly available while those 
of y are not, but it will give a biased estimate of error if the above conditions 
are not fulfilled. An example of the method is given for a stratified sample 
in Example 7.17. 

If Vx (r) is virtually constant, the sampling error of any ratio estimate can 
be rapidly calculated once the value of Vx (r) has been established, since only 
S (x) and S (x?) require to be known, formula 7.5.e being used. Similarly, 
if Vx (r) is inversely proportional to x, i.e. equal to A/x, formula 7.5.f can be 
used, only S (x) being required. The effective constancy of 2, and its value, 
can be most simply established by calculating V (F) in the ordinary manner 
for various batches of data and calculating the resultant values of 2 from 


formula 7.5.f. This is in general preferable to using formule 7.5.i and 
7.5.f directly. 


Example 7.8.a 


Estimate the sampling error of the ratio estimate of the acreage of wheat 
from the random sample of farms (Example 6.9.a). 


We have 
S (y?) = 207,261 S (xy) = 902,958 S (22) = 5,061,734 
F = -1522430 2F = -3044860 T? = -0231779 
Q = 49,643 Sq? = 400-35 
1 
SE, (f) = Iara VEC — 1/20) 125 x 400-35} = + 0-01443 
S.E. (Y) = 273,074 x 0-01443 = + 3,940 


Example 7.8.b 
Estimate the sampling error of the estimates of Example 6.9.b. 
We have 


n= 43 f = 43/325 = 0:1323 g = 7-558] 
S(y)=799  y =18-5814 & = 79-6977 S (x) = 3,427 
S(y?) = 22,065 S (xy) = 76,965 S (x?) = 328,393 
IS (y) = 14,846-5 J S (x) = 63,678-5 #S (x) = 273,124.09 
S(y —JP = 7218-5 S(« —%)(y —F) = 13,286-5 S(x— &)? = 55,199-0 
F = 0-233149 2F = 0-466298 r? = 0:0543585 


Q = 4023-6 s4? = 95-80 
100 
S.E. (100 F) = 375, V (08677 x 43 x 95-80) = + 1-744 
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Had the sample been a random sample of individuals the formula of 

Section 7.4 would have been applicable, giving 
S.E. (100 F) = 100 +/(0-8677 x 0-2331 x 0-7669/3427) = + 0-673 

The large difference between these two standard errors is an indication of the 
additional variability between kraals, and illustrates the misleading results 
that may be obtained by using the formula of Section 7.4 when the sampling 
units consist of groups of individuals and not single individuals. 

Since the total number of persons in the reserve is unknown, the standard 
errors of the total numbers are derived from the formule appropriate to a 
random sample without supplementary information. We therefore have 


S(y —FF2—Y =1T187 S(x — FW — 1) = 1,3143 
S.E. (X) = 7-5581 x +/(0°8677 x 43 x 1314-3) = + 1,673 
S.E. (Y) = T-5581 x (0-867 x 43 x 171-87) = + 605-3 


The standard error of the number present in the reserve can be calculated 
in the same manner from the sum of the squares of the deviations of (x — y), 
which in turn can be calculated directly from the separate values of (x — y). 
In the present case, where the separate values of (x — y) are not tabulated, 
and where S (x — £) (y — y) has already been calculated, it is more convenient 
to obtain the required sum of squares of deviations from the sums of squares 
and products already calculated (see Section 7.5). Thus 


S {(x —y)—(# u = 55,199-0 + 7,218-5 — 2 x 18,286-5 = 35,844-5 
S.E. (X —Y) — 7-5581 X {0:8677 X 43 X 35,844-5/42} = + 1349 


i ing derived from the same 
Note that x and y are not independent, being 
sampling units, and therefore V(X — Y) is not equal to V(X) + V (Y), but 
to V(X) + V (Y) "2 cov (XY). Putting cov (XY) equal to 13,286-5/42 gives 
the same result as above. 


7.9 Ratio method : stratified sample with uniform sampling fraction 
(a) When the ratio is assumed to 


Instead of taking the sum of squares of deviations from the general ratio 
line, the deviations from a series of lines parallel to this line and passing through 
must be taken, the divisor n — 1 


tie Points representing the strata means 
being replaced by n — t. 
Thus 


be the same for all strata : 


Q==sif(y—)—F@— AP 
= £ S; (y — fi)? — Z Si(y Ii 
sè = QJ(n — 2) 
he sums of squares and products wil 
and products within strata, similar to 


) (œ — ži) + F2 È Si (x — ži)? 


Il be recognized as the sums of squares 
those already obtained for the y variate 
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in the pooled estimate of error for a stratified sample without supplementary 
information. 


(b) When the ratio is permitted to assume different values for the different 
strata : 


2 z . , | 
The common f is replaced by ř; corresponding to the different strata, | 
so that 


Q = 2 Si (y — Ji) — 22 Fi Si (y — Fi) (£ — 81) + EFA Si (we — 51)? | 
The divisor n — £ stands. 


In this case the contribution to Q from each stratum is best computed 
separately. If desired the variances of the contributions to Y from the different 
strata may also be estimated separately. This course is equivalent to assigning 
slightly different weights to the different contributions to Q, the situation 
being analogous to that already discussed in Section 7.6. 

Note that if the population totals X; are not known for the different strata 
but the total X for the whole population is known, the formula for case (a) 
must be used for calculating the sampling errors of F and Y, even if the ratio 


clearly varies from stratum to stratum, since the method of estimation must 
be that corresponding to case (a). 


Example 7.9 


Estimate the sampling errors of the estimate of Example 6.10. 


The contributions to Q from the six districts are: 


District Qi District Qi 
T exe 5,107-59 4 ..  20,566-56 
2 on 1,550-71 GSE cas 3,737-14 
3 wa 7,963-98 6 a 1,080-92 


Torat 40,006:90 


Jos 1 40,006 -90 
S.E. (Y) = Bild { (1 -5) ET } = 3,610 


Hence 


7.19 Ratio method : stratified sample with variable sampling fraction 


If the ratio is assigned different values in the different strata the ‘variance 
of the estimated total is 


x? 
V(Y)=2 Revove Aya sat | 


= X {gi? (1 — fi) ni sqi2} approximately, 
V (F) = V (YX | 
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Here the sq? are estimated separately for each stratum, using the value of the 
ratio appropriate to the stratum and the divisor mj — 1 for Qi. 

If the ratio is assumed to be the same for all strata the same formula may 
be used, with the exception that the Q; are calculated using the general ratio, 
with divisors 2; — 1 as before. 

For an illustration of the application of these formule see Example 7.17. 


7.11 Ratio method: integral values of the supplementary variate 


When the supplementary variate x can only assume small integral values 
the above formule for Q can be simplified by classifying the data according 
to the value of x. The most common instance in censuses and surveys is in 
surveys of human p pulations in which the sampling units are households and 
information is required on individuals. j 

In the analysis of data appertaining to individuals the working unit in 
the analysis will commonly be the individual, although the sampling unit is 
the household. Clear distinction must therefore be made between the values y 
for the sampling units which in households of two, for instance, will consist 
of the totals of pairs of individuals, and the values for the individuals. These 
latter values we will denote by z, with the convention that [s] for families 
of more than one unit represents the total of the individuals in this family, 
so that [x] equals y. With this notation F = 5. Suflices will be used to indicate 
size of family ; ™,%, - - + to denote the numbers of families of the different 
sizes, 

No difficulty should be found in transforming the formule for Q into 
a form suitable for computation. In the case of a random sample, for instance, 
we find 


O =S,(y*) + So(y®) + -e — PF {Si (9) + 252 (9) +... } 
iT (ty 49g ee) 
= S, (2%) + So [sP + --- — 28 {S1 (3) + 2S: (s) +... } 
+ (n + 4g...) 
= S,(z — 3)? + Sa ([2] — 222)? + ..- +m (31 — 3)? 
+ 4m (2 —3P +... 
It will be noted that in order to calculate Q the quantities [=] are required. 
In the event of the survey material being recorded on punched cards, each 
individual will normally be assigned to a separate card. The required totals 
can then be obtained on the tabulator by sorting for family designation, family 
size, and any stratification which is required, and controlling on family 
designation, either printing the totals or reproducing them directly on to new 
family cards. The latter procedure will be advantageous if further analysis 
is required on the characteristics of families regarded as entities. 
This type of analysis can be confined to a special type of individual, such 
as adult males. If punched cards are used a card count will have to be 
introduced to count the number of the special type occurring in each family, 
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fi 


which may be termed the “ partial size” of the family. There is then the 
minor complication that families of different partial sizes will occur together 
in the tabulated results. This is overcome if the results are reproduced on 
cards, as the “ partial sizes” can be punched on the cards, and the cards 
subsequently sorted by partial size and listed. 

The last of the above forms for Q separates the various components of 
variance. The first term gives the contribution to Q arising from variation 
between families of one, the second that between families of two, etc., and the 
first term of the second set gives the contribution due to the average deviation 
of families of one from the general mean, etc. If the sample were stratified 
for size of family the second set of terms would be omitted. They will also 


be omitted if the error of a mean standardized for distribution of family size 
is required. 


The above formula for 


Q can be set out in analysis of variance form, as 
in Table 7.11. 


TABLE 7.11—ANALYSIS OF VARIANCE FORM FOR Q FOR INTEGRAL VALUES OF x 


Degrees 
of freedom Sum of squares 
Between families of size 1 A . m—l Si — 7) 
H S A 5 2 ake — al S,((z] — 27)? 
eee ex eT Sall] — 325)? 
Between means of families of different 
sizes : . ` , . t-1 nG -3 + dna(z, = 73)? Hou 


The mean square of the first line then gives the estimate s,? 
variance of families of size 1, the mean square of the second 1 
by 4, the estimate of the error variance of family means of size 
means of families of a given size can thus be compared for diffe 
the population, remembering that the further divisors for s3, 
on the numbers of families and not individuals entering into ti 


of the error 
ine, divided 
2, etc. The 
rent parts of 
etc., depend 
he means, 


7.12 Regression method: random sample 


The estimation of error in the regression method follows much the same 
lines as in the ratio method. The sum of squares of deviations from the ratio 
line is replaced by the sum of squares of deviations from the regression line 
and the divisor n — 2 is used instead of n — 1, since an additional degree 
of freedom is accounted for by the fact that the regression line not only passes 


through the mean point, but has its slope determined independently from the 
data. 
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The sum of the squares of the deviations from the regression line is given 
by the equation 
Q =S (y =y 
= S(y — 9)? —bS (x —z) (y —9) 
{S (x — 3) (y —5)¥ 
S (x — x? 


so that 


and, if errors in b are neglected 
v= is 
= — s$’ 
(¥) T 
The error variance of b, if the variance of y for fixed x is constant, is 


sè 
O S (x — 7) 
so that if the regression is truly linear the error variance of a standardized 
value yọ of y for the value Xo of x is 


1 (xo — 

voo={5 + S@—2) 

The correction for finite population is here omitted since standardized values 
are ordinarily used for comparative purposes. 

Allowance for errors in b can be made in V (¥) in the same way, but such 
errors will always be small relative to the other component of error. They 
will on the average increase the error variance approximately in the ratio 
ni(n — 1). 

In the case of two-phase sampling the sampling variance of X, will introduce 
the additional term 62V(X,) into the above expression for V(y). The 
general approach is given in Section 8.7. i 

If an arbitrary value by of the regression coefficient is used, instead of the 
value b calculated from the data, the formula for the sum of the squares of the 
deviations from the arbitrary regression line becomes 


Q = S (y — F} — 2b S (x — 3) (y — F) + bo? S (x — 8)? 
sf =O/(n = 1) 


This procedure is equivalent to the analysis of the values y — bọ x from 
the individual sampling units. Use of the above expression for Q saves the 
trouble of calculating the values of y — bo x for the individual sampling units, 
at the expense of calculating the sums of squares and products of x and y 
instead of the sums of squares of y — by. 

No allowance has to be made for errors in bo, but an arbitrary value bo 
should not be used for standardization unless it is known that bọ approximates 
closely to b, so that the error (bo — 6)(¥)—) introduced into the 
standardization correction is small. 


and 
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Example 7 .12.a a 
Estimate the errors of the estimates of Example 6.12.a. 


We have 
= 164,904 — 0-19316 x 624,739 = 44,229 
s? = Q/123 = 359-59 
V (¥) = (1 — 1/20) 359-59 / 125 = 2-733 = (+ 1-653)? 
S.E. (Y) = 2496 x 1-653 = + 4126 
359-59 i > PER 
V (b) = 3,234,970 7 0-0001112 = (+ 0-01054)? 


Example 7.12.b 


Estimate the error of the estimates of total volume of timber of Example 
6.12.b. 


Except for the estimate derived from the arbitrary value of the regression 
coefficient bọ = 1 the computations follow the lines already given and are 
left as an exercise for the reader. 

When bọ = 1 we have 


Q = 115,266 — 2 x 52,069 + 82,296 = 93,424 
s2 = 93,424/24 = 3893 


The values of the error variance per unit, and the resultant standard errors 
of the various estimates, are as follows : 


Variance Relative 

per unit S.E. (total volume) efficiency 
Sample plots only z è . 4,803 + 710,000 cu. ft. 72 
Ratio method . ` $ . 4,230 + 602,000 cu. ft. 82 
Regression, b = 1. a - 3,893 + 639,000 cu. ft. 89 
Regression, b = ‘6327 . - 3,579 + 613,000 cu. ft. 96 
Regression, by = +55 : - 3,454 + 603,000 cu. ft. 100 


The relative efficiency of the various methods of estimation is inversely 
proportional to the value of the variance per unit. Setting the last value at 
100 we obtain the relative efficiencies of the last column. The relative 
efficiencies fall in the order given. If the information from the sample plots 
only is used, neglecting the eye estimates, about 40 per cent. more sample 
plots will be required to give results of the same accuracy as those obtained 
by using a regression of 0-55 on the eye estimates. 

The values for bọ = 0:55 have been included to illustrate the fact that 
any value of the regression coefficient near to the value derived from the data 
will give results which are of about the same accuracy. Here there is an. 
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apparent small gain in accuracy owing to change in the degrees of freedom 
from 23 to 24. There is therefore no point in attempting to take account of 
small differences in the regression coefficient for different parts of the data, 
or to determine b very exactly. Any value reasonably near the correct value 
will give a satisfactory adjustment. 

The ratio method has given an estimate of the standard error which is 
relatively low because S (x) for the sample is high. The average performance 
of the ratio method may best be judged by the variance per unit. 

Limits of error can be assigned to the possible bias in the eye estimates 
by calculating the standard error of the difference % — y, or of the ratio F. 
We have 

S.E. (E — F) = +/(3893/25) = + 12-5. 


The actual difference, — 15-3, is 1-22 times its standard error, and these 
data therefore do not by themselves furnish conclusive evidence of the 
existence of bias in the eye estimates. Taking limits of error of + twice the 
standard error gives limits to the bias of — 40-3 and + 9-7, ie. — 27 per 
cent. and + 7 per cent. 

Similarly S.E. (7) = + 0-098, and since F = 1-116 the ratio of the deviation 
of F from unity to its standard error is 1-19, which compares with the value 
of 1-22 for # — Ĵ. 

As mentioned in Example 6.12.b, the more extensive data of the full 
survey confirmed the existence of bias, giving at the same time a more accurate 
determination of its average magnitude and variation for different types of 
woodland, 


7.13 Regression method : stratified and balanced samples 
(a) Uniform sampling fraction, regression coefficient the same for all 
strata : 
As for a random sample, except that 
Q = E Si (y — Fi)? — b 2 Si (y — Fi) (x — i) 
=O 1) 
V (b) = 52/2 Si (x — ži)? 
If an arbitrary value bọ is taken, the formula for Q must be rewritten in 
the same manner as in an unstratified sample, the divisor being n — t. 
(b) Uniform sampling fraction, regression coefficients different for the 
different strata : 
bi = Si (y — Fi) (& — 2i)/Si (x — 31)? 
If a pooled estimate of error is used, 
„Q, = È Si (y — Fi)” — È bi Si (y — Fi) (x — xi) 
sf = Offa — 2) 
V (bi) = 5,2/Si (Œ — 2i) 
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If the variances from the regression lines in the different strata are likely 
to be different, separate estimates of c? should be used when evaluating V (bi). 


(c) Variable sampling fraction : 
The method is similar to that outlined for the ratio method. 
(d) Balanced sample : 


As far as the estimation of error is concerned, a balanced sample must 
be treated as if it were an unbalanced sample from which estimates have been 
derived by the use of regression on the balanced variate. There will be no 
regression adjustment to the estimates of the population values, since % — ã 
will be zero because of the balancing. 


7.14 Calibration of eye estimates 


When a regression is used to calibrate eye estimates, as described in 
Section 6.15, the sampling variance of y can be split into three parts, that due 
to errors in b’, that due to the sampling variance of #, arising from the main 
sampling process, and that due to the variance of #, —# arising from the 
variance about the regression line. 

The component of variance due to errors in b’ is usually sufficiently small 
to be neglected. To a first approximation it equals 


Œ — 2) V (6) /6" 
where V (b') is calculated in the ordinary manner from the regression. 
The variance of #, is calculated from the values of x for all the selected 
units in the manner appropriate to the method of sampling adopted. The 
contribution to V (¥) from this source is approximately 


V (%)/6? 


A closer approximation is obtained by multiplying this variance by 
{V (x) — V, (*)}/V (x), where V (x) is that part of the variance of x which 
contributes to V (%,) and V, (x) is the residual variance of x about the regression 
line. 

The variance of #,— due to variance about the regression line is 
calculated from the residual variance V, (x) of x about this line. If n, and 
n represent the numbers of units in the original sample and the sub-sample 
for eye estimates, the contribution to V(¥) when ail units are given equal 
weight in the mean is 

(n, — n) V, (x)/b’? nn, 


If the x’s are weighted according to area a or other weights, the last expression 
becomes 


V(x) SANT S) Se) 
p2 (S a OAE ord 


where Sı S and S’ indicate summation over the whole sample, over the 
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sub-sample, and over the part of the sample not included in the sub-sample, 
respectively. 


Example 7 .14 


Estimate the variance of the mean yield per acre obtained in Example 
6.15. 


The component of variance due to errors in b’ is negligible, since ¥, — ¥ 
is nearly zero. 

Since %, was calculated by weighting the weighted mean eye estimates 
of the individual farms by the wheat acreages of these farms, the ratio method 
of calculating the sampling error at the first stage is applicable. A table was 
therefore prepared giving for each farm, (1) the total wheat acreage, (2) the 
weighted mean yield per acre based on the eye estimates of all the chosen 
fields on that farm, and (3) the product of these two numbers. Columns (1) 
and (3) constitute, in the ordinary ratio notation, the x and y values. Using 
these tabulated values, and the ordinary formula for the variance of a ratio, 
with the inclusion of the factor (1 — f), we find V (#,) = 0-8200,* and 
Consequently the corresponding component of variance is 0-8200/0-6926? or 
1:7095. The factor (1 —f) can properly be included here since for the 
Majority of farms all the fields were taken. On the other hand, although 
the variance per field of the eye estimates is probably reasonably constant, the 
alternative approach outlined in Section 7.5, making use of this fact, would 
Present difficulties, since the sampling units at the first stage are farms and 
not fields. The direct approach is therefore simpler. 

The residual variance about the regression line was found to be, from an 
analysis of the unweighted data for fields, V, (x) = 7-038. The sums and 
sums of squares of the areas of the individual fields for which actual yields 
are and are not available, and of all fields, are 

S(a)= 610 S'(a)= 1279S, (a) = 1889 

S(a@) = 15,172 S’ (a@*) = 33,899 S, (a?) = 49,071 
Substitution in the formula above gives a component of variance of 0:4137. 
Fields and not farms can reasonably be used here, since errors in the eye 
estimates may be expected to be reasonably independent from field to field. 
This is not the case with V (%,), since the yields of fields on the same farm 
often show considerable correlation. 

The standard error of the adjusted mean yield is therefore 
/(1-7095 + 0-4137) = + 1-46 bushels per acre. The main source of error 
is that due to sampling errors introduced by the variation in yields from farm 
to farm. The eye estimates are shown to be sufficiently consistent and to 
give adequate differentiation between differing yields. Regarded as an estimate 
of the mean yields of the fields actually sampled, the adjusted mean yield has 
a standard error of only +/0-4137 = + 0-64 bushels per acre. 


* The closer approximation gives the value 0:733. 
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7.15 Sampling with probabilities proportional to size of unit 


In this case unbiased estimates of the sampling variances are those based 
on the mean square deviation of r. When units selected more than once are 
included the number of times they are selected, no correction for finite 
population is required. If s, is the estimated variance of 7 we then have, for 
a random sample, 


s? = S (r —7)?/(n — 1) 
I 
V (F) = s? 


If the size of the population is known we therefore have 
V (F) = s?/n 
V (Y) = X? s/n 


If the size of the population is not known, the sampling variance of the 
estimate of total size X is derived from the formulæ for a qualitative variate 
(Section 7.4). We have, for a random sample, following the notation of 
Section 6.16, 


Hence, if A is known exactly, 
ae n X? ng—n 
V (X) = A? a (a z) fn ET 
V(Y) =X? V rV 
2 ax 
= 2 (se ao a) 


No 


If A is not known exactly its estimation will contribute some slight 
additional variance, the amount of which depends on the precise method of 
location of the points. This, however, will in general be sufficiently small 
to be neglected. Substituting A = m)/d for A, we have 
n (no — n) 


VEO dno 


Example 7.15 
Estimate the sampling errors of the estimates of Examples 6.16.a and 
6.16.b, given that the standard deviation per field of the yield per acre is 
3-5 cwt. per acre, and that the distribution of points can be taken as random. 
The standard error of the mean yield per acre is 
S.E. (F) = 3°5/4/529 = + 0-152 cwt. 
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For the estimates of Example 6.16.a, 


529 x 7788, 
—————-— acres? 
8317 


S.E. (X) = + 57,000 acres 


V (X) = 25602 x 


m2 T788 g eo A ©) o 
Be cee 15-78) = 84-24 X 101° cwt.? 
ov 


S.E. (Y) = ++ 45,900 tons 


For the estimates of Example 6.16.b, 
02 x 31,053 
33,255 
S.E. (X) = + 29,000 acres 
V (Y) = 1,409,000? x 3-5?/5 
S.E. (Y) = + 25,200 tons 


842-2 x 10° acres? 


V (X) = 6410? x 


29 + 15-72 x 842-2 x 10° == 2536 x 108 cwt.? 


Note that if the total area of crop were known accurately from other 
Sources we should have 
1,354,000? 
~~. 3-5? ewt.2 


V1) = ~ 539 
S.E. (Y) = + 10,300 tons 


If the acreage is not known a survey of this type will clearly be more efficient 
if sample harvesting is carried out at a proportion only of the sample points, 
the presence or absence of the crop being determined at the remaining points. 
This point is discussed in more detail in Section 8.17. 

If the sample is such that it can be regarded as stratified by districts it 
might at first sight appear that the betieen-districts component should be 
eliminated from the variance of r. Unless the crop areas of the different 
districts are accurately known, however, this must not be done, since the 
Proportion of sampling points falling in the crop in a district will not be 
accurately proportional to the area of the crop in that district. 

If the sample points are confined to some localities only by a two-stage 
sampling process, with localities as first-stage sampling units, the above variances 
will represent the second-stage components of variance only. The full 
sampling errors must be determined from the first-stage units as explained 
in Section 7.17. 


7.16 Sampling from within strata with probabilities proportional to 
size of unit 


Since the number of units within each stratum is in general small, a pooled 
estimate sr? of the within-strata variance of y based on the mean square 
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deviations of 7 within strata will be required. Consequently 
= Si (r — fi}? 

z n—t 

V (Fi) = s? (1 — fi)/ni 


V (Y) =s? 2 {KF (1 — fi)/ni} 
V (F) = V (YX? 


Sr? 
We then have 


and thus 


If the variation in the within-strata variances is large it may be necessary 
to introduce weights when forming the pooled estimate of error, in order to 
avoid bias. 

The correction for finite sampling is not required if the units selected 
more than once are included the number of times they are selected. In the 
more usual case in which each unit is included once only, additional units 
being selected by the method of Section 3.10, the correction for finite sampling 
should be included. In this case, since probability of selection is not strictly 
proportional to size, the formulæ are approximate only. 


Example 7.16 


Estimate the sampling errors of the estimate of Example 6.17 


The analysis of variance of the values of the ratio for the individual 


“ combined” parishes is given in Table 7.16. It follows the lines of 
Example 7.7. 


TABLE 7.16—ANALYSIS OF VARIANCE OF THE VALUES OF THE RATIO 


Degrees of Sum of Mean 

freedom squares square 

Between districts . m 6 04952 008253 
Within districts . . 10 02649 002649 
Totar . 5 í z 16 07601 “004751 


This gives a value of s,? of 0-002649. We then have 
= {XP (1 — fi)/mi} = 22,932? x $ + 43,5912 x 43/3 +... = 37-188 x 108 
S.E. (Y) = +/(0-002649 x 37-188 x 108) = + 3140 


It will be noted that the variance of r within districts is less than the 


overall variance. Consequently stratification by districts appreciably reduces 
the sampling error. 


7.17 Multi-stage sampling 


If the sampling fraction at the first stage is small, the total sampling error 
of multi-stage sampling is obtained from the first-stage unit values, estimating 
each unit value from the results of the sampling at the second and following 
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stages, and using the method of estimation appropriate to the method of sampling 
at the first stage. The additional variability contributed by the second and 
following stages is automatically included in this estimate of error. 

If, on the other hand, the sampling fraction f’ at the first stage is not 
sufficiently small for the factor (1 —f’) to be neglected, the sampling error 
will be increased on account of the fact that the selected first-stage unit values 
are themselves subject to sampling error, instead of being known exactly, as 
in single-stage sampling. The increase in the sampling variance of whatever 
estimate is under consideration will be equal to f’ times the variance in this 
estimate resulting from the sampling at the second and following stages. This 
variance can be calculated by regarding the selected first-stage units as strata 
which are sampled by the sampling at the second ‘and following stages. 

Thus, for example, in two-stage random sampling, with 7’ selected first-stage 
units, and n” second-stage units selected from each first-stage unit, if s is 
the estimate of the sampling variance of the first-stage unit means, ze. the 
means of the second-stage values for the separate selected first-stage units, 
s’ is the variance of the second-stage units about the first-stage unit means, 
and f’ and f” are the first- and second-stage sampling fractions, the sampling 
variance of the mean of the selected first-stage units will be (L — f”) s'”?/n'n", 
since this mean is based on n’n”” second-stage units, and therefore the sampling 
variance of the mean of the population is 


1—f' 


n’ 


v= 


Example 7.17 


Calculate the sampling errors of the estimates of the mean dressing of 
nitrogen per acre obtained in Example 6.19. 


The y’s and x’s entering into the ratio method estimate at the first stage 
are the successive terms of the expressions for S (gy) and S(g’’x), already 
given for small farms, namely 1-36, 3-78, $30; « « « and 2, 6, 6, 
respectively. 

The calculation of the sampling error follows the lines indicated in 
Section 7.10, the ratio (cwt. nitrogen per acre) being assumed, for the reason 
given at the end of Section 7.9, to be the same for all strata. 

The values of sqi? are found to be 


Small farms : 27-519/21 = 1-3104 
Medium farms: 583-81/35 = 16-680 
Large farms : 2779-1/8 = 347-39 


We then have, neglecting the factors (1 —fi)s 
V (F) = (105? x 22 x 1-3104-+ 59? X 36 X 16-680 + 30? x 9  347-39)/58,229? 
= (+ 0-0392)2 


to 
W 
a 


SECT. 7.17 SAMPLING METHODS FOR CENSUSES AND SURVEYS 


The sampling fractions are here all small and the variance at the second 
stage therefore need not be considered. With the present material this variance 
could not in any case be estimated since only one field per farm was selected. 
In such cases, when the fi are not small, it is still best to neglect them. The 
sampling error will then be slightly overestimated. 

Farms without sugar beet have been excluded from the above calculation. 
Their inclusion, though substantially decreasing the values of sqi?, would 
make little difference to the final estimate of error, since the values of mj in 
the formula for V (Y) would be correspondingly increased. 

From inspection of the data, and from the nature of the material, we may 
expect that the variance of the mean dressing per acre 7 will be substantially 
constant, irrespective of the size of field to which it is applied. The alternative, 
procedure of calculating the variance of r directly without any weighting may 
therefore be followed without serious risk of introducing any marked bias 
into the estimate of error. 


TABLE 7.17—ANALYSIS OF VARIANCE OF DRESSINGS OF NITROGEN PER ACRE 


Degrees of Sum of Mean 

freedom squares square 

Small farms . . . 21 1:1304 0:0538 
Medium farms . . . 35 1:3500 0:0386 
Large farms . . : 8 0:9836 0:1230 
TOTAL . . . . 64 3-4640 0:0541 


The within-size-groups sums of squares and mean squares of 7 are shown 
in Table 7.17. There is no marked difference between the different size-groups, 
and in the following calculations we will therefore use the pooled estimate of 
the mean square, s? = 0-0541. 

From the formule of Section 7.5 we have, for a single stratum, 


V (Fi) = sri? (1 — fi) Si (**)/{ Si (x) }? 
V (Yi) = sri? (1 — fi) Si (33) 


where the x’s are those entering into the first-stage sampling, i.e. gx in the 
full notation. The sums of squares of gw will be found to be 580, 15,245 
and 39,993 for small, medium and large farms respectively. Consequently, 


summing over all strata as before, omitting the (1 — fi), and taking the variable 
sampling fraction into account, we have 


V (F) = 0-0541 (105? x 580 + 592 x 15,245 + 302 x 39,993)/58,2292 
= (+ 0:0391)? 


This is almost identical with the value previously obtained. 
The comparison between the two methods of calculating the variance may 
be taken a stage further by estimating the values of sri? from those of sqi? and 
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comparing them with those obtained directly. Equating the two expressions 
for V(Fi), we find sr? = ni sqi*/Si (g"x)*. Using the values of sq? already given 
we obtain for sri? the values 0-0497, 0-0394, 0-0782 respectively. These show 
no consistent divergence from the values of Table 7.17, and we may therefore 
conclude that the bias in the estimation of error by the second method is likely 
to be small. A more thorough investigation could be made by tabulating a 
number of comparisons of the above type from various batches of similar 
data. 

The value of sr? given by Table 7.17 is directly appropriate for the calculation 
of the variance of the estimate from the unweighted means (Table 6.19.c). 

The sampling standard error of the mean dressing over all fields is 
»/(0-0541/67) = + 0:0284. This standard error does not include any errors 
due to bias, but will be appropriate, or at least approximately so, to comparisons 
of such a nature that the major part of the bias is eliminated. If there were 
large differences between size-groups the question of whether the pooled 
Within-size-groups variance or the overall variance is appropriate to the 
comparison in question would have to be considered—this, however, involves 
other problems, such as how far the differences observed are due to differences 
in size-group proportions (see examples 7.5 and 7.7.b). 

It will be noted that the standard error of the properly weighted ratio 
estimate is considerably greater than the standard error of the straight mean. 
The ratio of the squares is 0-0392*/0-0284* = 1-91. Thus about double the 
number of farms, excluding those without sugar beet, are required to attain 
the same accuracy when unbiased estimates are required. This is inevitable 
ina survey of this kind where the sampling fractions cannot be adjusted so as 
to be proportional to the areas of the crop being sampled, a course which is 
impossible when a number of crops are covered in the same survey, even if 
the necessary information is available. 


7.18 Systematic samples 


No fully valid estimate of the sampling error of a systematic sample is 
possible, since the units are not located at random within defined strata. 
Approximate estimates can be made in various ways. The simplest, which 
will suffice for most census and survey work, is to divide the material arbitrarily 
into strata, and calculate the sampling error as if the units were selected at 
random from these strata. 

In the case of a systematic sample from a list it will usually be sufficient to 
take account of the major groupings of the list, treating these as strata, and 
to ignore any minor and ill-defined groupings. An example of this has already 
been given (Example 7.7.a). 

in the case of one-dimensional systematic sampling, e.g. equally spaced 
points on a line, or equally spaced lines covering an area, the strata may be 
taken to contain pairs of successive units, so that the error variance is estimated 
from the differences between the members of the pairs. Each difference 
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contributes one degree of freedom. If there are n such differences d, the error 
variance per unit is therefore 


2 = 4S (d)/n' 


Since the pairing is arbitrary, instead of taking alternate differences between 
successive units all differences may be taken. This is equivalent to taking two 
sets of overlapping strata. The accuracy of the estimate of s? is thereby somewhat 
increased, though it is not doubled, owing to lack of independence between 
the successive differences. 

In the case of two-dimensional sampling on a square or rectangular pattern 
the strata should consist of sets of four units ina 2 X 2 pattern. By this means 
variability in both directions will be taken into account. There is no point 
in taking overlapping strata. Since each such stratum contributes 3 degrees 
of freedom the formula for the error variance per unit is 


$ = [S (3°) — FE {Si (9) P]/3 w 


where n’ is the number of strata. 

In the case of line sampling a complication arises if all the lines are not of 
approximately equal length. If the total area covered by the sample is known, 
the most accurate estimate of the quantity under consideration will be obtained 
by the ratio method. In this case the calculation of the sampling error should 
strictly follow the method given in Section 7.9 for a stratified sample estimated 
by means of a constant ratio. With strata of two units the formula for Q 
becomes 


Q =} S (d?) — 2F . $ S (dedy) + FY S (dx?) 


` This will eliminate the variability due to variation in length of line. If the 
total area is not known, so that the final estimate is obtained by multiplication 
of the total over all the lines by the appropriate raising factor, the difference 
method given above, and not the ratio method, must be used. 

These methods of estimation of the sampling error are also applicable to 
line samples in which the lines are randomly located in pairs within blocks 
and thus form a proper stratified random sample. In this case the estimate 
of error will be fully valid. 

In either systematic or stratified random line sampling, the variation in 
the length between neighbouring lines will not be large unless the boundaries 
of the area covered are very irregular. Consequently the approximate method 
based on the direct differences can be used in most cases without serious 
inaccuracy. 

The above methods of estimation of error for systematic samples will give 
overestimates of the sampling error, provided there are no periodic features in 
the material, and provided in two-dimensional sampling that there are no 
marked strip effects running in straight lines across the material in such a 
manner that the whole of one line of sample points falls on the same strip. 
If a closer estimate is required, an alternative, but rather more complicated, 
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procedure is available. In one-dimensional sampling, instead of taking successive 
differences, differences of the type 


d = hy, — Ye + Ysi Jat Is —Ye+I7 — Ys + hyo 
can be taken. Such differences may be called balanced differences. Most of 
the systematic component of variation is thus eliminated. The number of 
terms included in each difference is to a certain extent arbitrary, but 9 is 
chosen as a convenient compromise. With extensive material there will be 
no need to take overlapping differences, the best procedure being to have 
overlap of the end terms only, so that the yy of the first difference is taken as 
the y, of the second. With this convention the sum of all the differences is 
equal to one-half the first and last included terms plus the sum of all the 
remaining odd terms minus the sum of all the even terms. The square of each 
difference contributes one degree of freedom, the divisor being given by the 
sum of the squares of the coefficients, i.e. 7-5. Consequently s? = S (d*)/7-5 n’. 
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Fic. 7.18—COEFFICIENTS FOR CALCULATING THE ERROR OF A SYSTEMATIC 
TWO-DIMENSIONAL SAMPLE 


A similar procedure can be followed in the case of two-dimensional 
Systematic sampling, the most convenient type of difference being that given 
by the coefficients shown in Fig. 7.18. Here again, the margins of the square 
covering one difference may be taken as the margins of neighbouring squares. 

e divisor in this case will be 6}. 

€ estimates provided by balanced differences will also in general be 
Overestimates of the sampling error, but may be expected to be closer than 
those based on ordinary differences. If there is no wide discrepancy between 
the two types of estimate it may be concluded that the degree of overestimation 
is not likely to be great. More exact estimates can only be obtained by taking 
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supplementary observations at intermediate points allocated either at random 
or systematically. The one-dimensional case has been discussed in detail 
by Yates (1948, A). 

The above methods of estimation can be applied both to quantitative and 
qualitative data, but in the case of qualitative data, based on either one- or 
two-dimensional point sampling, a rapid estimate of the sampling error can be 
made by using the formule for a random sample, as in Example 7.15. This 
will tend to give greater overestimation of the sampling error than the above 
methods, but if the parts of the line or area possessing the attribute are small and 
irregularly distributed, with no great variation in density in different parts of the 
line or area, the estimate will be sufficiently good for most practical purposes. 


Example 7.18 


In the 1942 Census of Woodlands the total area of woodland shown on 
the maps was determined for each county by estimating the area of land 
coloured green on the l-inch O.S. maps. ‘This was done by measuring the 
total length of the E-W kilometre grid lines which fell in green areas. The 
results for O.S. sheet No. 115 covering part of Kent are given in Table 7.18. 
Estimate the sampling error of this process. 


TABLE 7.18—WOODLAND AREAS FROM LINE INTERCEPTS (cm.) 


Length ; Length 5 
Grid Sene coloured Suctessive Grid Tenge BFA successive 
line Hine; green, 3 es, | Tine of line, green, ifferences, 
Y 3 dy 
i 
98 3-5 0-0 = 83 30-0 3:8 pla 
97 4:2 0:9 +0-9 82 29-4 4-1 + 0-3 
96 9-2 0-0 — 0:9 81 29-1 4-9 +08 
95 12-6 0-0 0-0 80 28-8 6-0 + 1-1 
94 15:5 0:3 +03 79 28-6 5'4 — 0-6 
93 21:2 0-1 — 0-2 78 28-2 2:3 — 31 
92 25:2 0-5 +04 77 272 2-9 +06 
91 25-4 3-1 + 2-6 76 26-3 2'1 — 0-8 
90 31-2 2-8 — 0:3 75 25-4 6:3 +42 
89 34-2 2-7 —0-1 74 8-2 +1:9 
88 34-1 2-8 +01 73 2 5'4 — 28 
87 33-0 2-6 — 0:2 72 24-9 6-6 +12 
ee te ae = 3 71 24-6 6-6 0-0 
5 g * 2 7 20+ 5 — 2-5 
84 30°7 2-4 ET S be ae i 
716-4 92-7 


The successive differences dy of the lengths coloured green, y, are shown 
in the fourth and eighth columns. We find S (dy?) = 63-21 and Senay, 
since there are 29 lines, 


s? = } 63-21/28 = 1-1288 
S.E. {5 (y)} = V/(29 s?) = + 5:72 
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A length corresponding to 1 km. represents an area of 1 sq. km. 
Consequently the raising factor to be applied to the total length measured in 
cm. to give the estimated area in acres is 63,360  247-11/100,000 = 156-57. 

The total area of woodland is therefore 


156-57 x S(y) = 156-57 x 92-7 = 14,514 acres 


and the standard error of this area is 156-57 X 5-72 = +896 acres, Ze. 
6-2 per cent. 

The same procedure can be followed for the other maps covering the county, 
and the square root of the sum of the squares of the resultant standard errors 
will give the standard error for the whole county, since the errors of the different 
maps are virtually independent. The results for Kent gave a percentage 
standard error of 3-4 per cent. 

If the ratio method is used the corresponding successive differences dx of 
the total lengths of the grid lines, and the sums of squares and products S (dx*) 
and S (dx dy) are required. The latter are found to be 159-23 and + 2-62 
respectively for the map in question. Using the ratio F = 0-12940 derived 
from this map, we find 


sq? = Q/28 = 1-1643 


As expected, there is no appreciable difference in the error calculated by 
the two methods. The simpler method is consequently all that is really 
required, even when the total area of woodland is calculated from the total 
area of the county and the ratio of the length coloured green to the total length 
of the grid lines. If, however, the first grid line is much shorter than the rest, 
owing to its cutting the map boundary at a small angle, it should be omitted, 
or the length made up by taking the relevant part of the line on the neighbouring 
map. This trouble will only arise if the error is estimated separately for each 
map and the grid lines are not exactly parallel to the map boundary. 

On the other hand, the calculation of the error from the total variance of 
Y, ie. without stratification, would give very misleading results. 


7.19 Sampling on successive occasions 


It will be suficient if we record the variances of the estimates given in 
Sections 6.21 and 6.22. 


(a) Two occasions only: sub-sample on the second occasion. 
V F) = {V (y) — ub? V (x)}/2n = (1 — pr) V (y)/ An 
VO — x3) = {V (y) + (A — 228 — pb?) V (x)}/ån 

If the population value on the second occasion is estimated by adding the 
estimate of change derived from the sub-sample only to the overall mean on 
the first occasion, ie. by 7’ — 3 + %, the variance of this estimate will be 

V (p — # + 8) = {V (9) — u (28 — 1) V (@)}/ An 
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(b) Two occasions only : part of the sample replaced on the second occasion. 
V Gu) = HV) 
nl = ) 
or, in the case of unequal numbers on the two occasions, 
1 — ur’) Y (y) 
Vj») = C—O) 
n +n" (1 — ur?) 
With equal numbers on the two occasions, the variance of the estimate 
of change given by formula 6.21.b is approximately 


a= o) + V ()} 
n(1 — ur) 
The variance of the estimate given by the difference of the means of the 
units occurring on both occasions is approximately 


V (F — 2) =(1 — r) {V (y) + V (x) }/An 
and of that given by the difference of the overall means is approximately 
V (F — 2) = (1 — 2r) {V (y) + V (*)}/n 
The exact expressions in the last two cases are given by replacing 7 by 
2 cov (xy)/{V (x) + V (y)} 
which is equal to r when V (x) = V (y). 


V. (change) = 


(c) Successive occasions: same fraction replaced on each occasion. 


The limiting value of V (fn), subject to the restrictions mentioned in 
Section f..22, is as follows :— 


V (yn) = p V (y)/un 
The variance of the estimate of change given by ¥; — Yn -1 is 
V Fa — Yn = 1) = 29V (9) {1 — + (1 — g) yun 
Example 7.19.a 


Estimate the sampling errors of the estimates of Example 6.21. 


V (x) and V (y) may reasonably be taken as equal. The pooled estimate, 
based on all the observations on each occasion (22 degrees of freedom) is 
0-08767. 


We then have 
a 0-08767 (1 — 4 x 0-8472) 
V Ow) = Tel a x oat — = O-OTTT? 
This may be compared with the variance of the overall me 
0-08767/12, i.e. 0-0855°. ‘The ratio of these variances is 1-2 
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gain in efficiency by the use of the information provided by the sampling on 
the first occasion is 21 per cent. 
Similarly 
2 x 0-08767 (1 — 0-847) 
12(1 —3 x 0-847) 


V (change) = 0:0558? 

The variance of the change estimated from the units sampled on both occasions 
is 2 x 0-08767 (1 — 0-847)/S, i.e. 0-0579%. The ratio of these variances is 
1-077, and the gain is therefore 8 per cent. If the change is estimated from 
the difference of the overall means the variance is 


V (7 — 8) = 2 X 0-08767 (1 — § X 0-847)/12 = 0-0798? 


The ratio of this variance to the first variance is 2-042, and the gain is therefore 


104 per cent. 
It will be noted that when V(x) is taken as equal to V(y) all the above 
gains depend solely on the values of 2 and r. 


Example 7 .19.b 

Estimate the sampling errors of the estimates of Example 6.22. 

Owing to variation in the numbers on the different occasions the above 
formule will only give approximate estimates of the sampling errors. Excluding 


January, the average number of observations per occasion is 9-8 and the average 
value of J is 0-664. Since r = 0-811, 1 — r? = 0-343, and hence 


— 0-343 + +/[0-343 (1 — 0-657 x 0-108)] 


= 0-25 
eS 2 x 0-664 X 0-657 0-254 
V (y) was found to be 0-0871, and hence 
0-254 x 0-0871 , 
paaa S ET E es08 
VG) = -0.336 x 98 
Euan 2 x 0-254 x 0:0871 (1 — 0-811 x 0-746) : 
V (Yn — Yn —1) = EOE = 0-0729? 


7.20 The error graph 


In various instances in the preceding sections pooled estimates of the error 
variance have been used. Such pooled estimates are only legitimate if the 
error variance is reasonably constant over the parts of the population for which 
the pooling is carried out. In many types of material such constancy does not 
exist, and in such cases, when the number of degrees of freedom is too small 
for accurate determination of the error variances of the different parts of the 
population, other procedures must be followed if sampling errors are required 
Separately for the different parts. i 
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The simplest and most convenient device for practical use is the error 
graph. The estimates of the error variance are plotted against some other 
characteristic of the parts of the sample which is believed to govern the 
magnitude of the error variance, and a smooth curve is drawn to fit the points 
as closely as possible. This curve gives the variance law from which revised 
estimates of the variance can be obtained for any given value of the determining 
characteristic. 

Fig. 7.20 shows a graph of this kind obtained in the course of a survey 
of wireworm infestation in grass fields. In each field 20 cores were taken and 
the wireworms counted in each core. There are thus 19 degrees of freedom 
for the determination of sampling error in each field, The estimates of the 
percentage variance so obtained were plotted against the estimated number 
of wireworms per acre in the various fields. The smooth curve so obtained 
was used to provide a table of errors that might be expected in similar sampling. 
Table 7.20 gives a small abstract of this table, and also of the inverse table, 
obtained by interpolation from the first table, giving the fiducial limits associated 
with any given observed number of wireworms. This procedure is approximate 
in a number of respects, but detailed discussion would be out of place here. 


TABLE 7.20—DIsTRIBUTION AND PROBABLE LIMITS OF ERROR OF SAMPLE 
ESTIMATES OF WIREWORM POPULATIONS OF GRASS FIELDS SAMPLED BY TWENTY 
4 IN. CORES (1 core = 1/500,000 acres) 


(1,000 per acre) 


| 
a Population which, in one- 
One-eighth of sample eighth of cases, would 
True estimates Estimated give an estimate 
population MEER e population | —~+~—________, 
less greater not less not greater 
than than than that than that 
observed observed 
200 105 295" 200 128 325 
400 260 540 400 284 567 
600 428 772 600 451 804 
800 597 1,003 800 624 1,040 
1,000 766 1,234 1,000 797 1,277 


There are, of course, various other methods of dealing with problems 
of this type. In biological work the original variates are often transformed 
to other variates, such as logarithms or square roots, which may in the material 
in question be expected to have a more constant variance, and thus permit 
pooling of the estimates of error. Such procedures introduce a number of 
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complications ; in particular the means of the transformed variates, when 
transformed back into the original variates, will be biased. They are not 
generally necessary or advisable in sample censuses and surveys. 
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L je L fr 
0 250 500 750 1090 7250 
FOPULATION. 1000 PER ACRE 
Fic. 7.20—STANDARD ERRORS PER UNIT CORE OF 4 IN. DIAM. 
(WIREWORM SurveEy, 1940-1) 
o means for 2272 fields grass in 1940 fitted to data from grass fields 
e means for 525 fields arable in 1940 - — - Poisson distribution 


Reproduced from Yates and Finney (1942, J) with the 
ditors of the Annals of Applied Biology. 
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7.21 Sub-sampling for the estimation of error 


In an extensive survey the calculation of the sampling error from the whole 
of the data would be very laborious, and would provide estimates of error 
which are unnecessarily accurate. In order to cut down the work a sub-sample 
of the whole of the material may be taken, or estimates of error may be calculated 
for certain parts of the survey only, e.g. certain strata, with or without sub- 
sampling. 

A convenient method of sub-sampling, which is applicable if there are a 
large number of separate strata of approximately equal size and a pooled 
estimate of the error variance per unit is required, is to select a random pair 
of units from each stratum, and to take the differences between the two 
members of each pair. In this case each difference d contributes one 
degree of freedom. If there are ¢ differences the estimate of o? is given by 
S (d2)/28, 

If the strata are few in number and of unequal size this method is not 
applicable, since the number of differences would be inadequate and the 
different strata would not be represented in proportion to their size. In general 
it is important to see that the contributions to the error variance from the 
different parts of the population are substantially the same in the sub-sample 
as they would be if the whole of the data of the original sample were used. 
For this reason the sub-sample should in general be obtained by the use of a 
uniform sampling fraction over the whole of the original sample. A systematic 
method of selection will usually be satisfactory. 

The taking of a sub-sample in this manner is somewhat troublesome, and 
also prevents accurate comparisons of the errors of parts of the survey which 
are in themselves small and therefore inadequately represented in the sub- 
sample. For these reasons the more convenient method of calculating the 
sampling error for certain parts of the population only is often employed. 
This procedure will lead to inaccuracies if the variability of the omitted portions 
is different from those that are included, but these inaccuracies can be reduced 
by selecting the parts to be included on a proper random basis. Thus in the 
1942 Census of Woodlands the sampling error was calculated by selecting 
two counties at random from each of the seven regions, the data of the first 
5 per cent. sample only being used. The surveyed quarter-sheets within each 
of these counties, which were selected on a systematic grid pattern, were treated 
as if they were a random sample from all the sheets of the county. 

With grouped data the calculation of the sampling error from the whole 
of the data may well not present any appreciably greater labour than the use 
of a sub-sample, and in such cases the whole of the data will naturally be used. 


7.22 Rounding-off and grouping errors 


If a constant grouping interval over the whole of the range is adopted, 
the additional variance per unit introduced by the grouping is 4> of the square 
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of the grouping interval. If the true variance per unit is o? and the grouping 
interval is a, the total variance per unit of the grouped data is consequently 
o? + 31; a, and the fractional loss of accuracy due to grouping is a?/12 o°. 

Rounding-off is a form of grouping. In terms of units of the last place 
that is retained, the variance per unit due to rounding-off is equal to zy. 

If the variance is estimated from rounded-off or grouped data the additional 
variance due to grouping will be included in the estimate. If for any purpose 
an estimate of the true variance is required a deduction of js the square of 
the grouping interval must be made. This is known as Sheppard’s correction. 

Grouping will also result in some loss of accuracy in the estimation of the 
variance. In a sample from a normal distribution the fractional loss of 
information due to this cause is @°/6 o°. 

The above formule are approximate, but provided the distribution is 
reasonably symmetrical they can be used in all cases ordinarily occurring in 
practice, in which a is not likely to be greater than, and is usually considerably 
less than, o. 

If the distribution is markedly skew, however, grouping may introduce 
a bias in the estimates which is not included in the above variance formulz. 
An extreme case is provided by distributions in which there are a large number 
of small values and only few large values, such as the distribution of acres 
of crops and grass, or of wheat, in the farms of Table 6.6.a. In this type of 
distribution, if grouping is used, a smaller grouping interval must be employed 
for the small values than for the large values, as in Example 7.2.b. If only 
comparisons between similar estimates are required a coarser grouping may be 
adopted than if absolute values of the estimates are required, since in the 
former case any bias introduced will affect all estimates similarly. 


Example 7.22 


Determine the loss of accuracy in the estimation of the mean due to 
grouping in the data in Example 7.1.b. 


The grouping interval is 300, and the standard deviation per unit (including 
grouping errors) is 526. Their ratio is 0-571, and the fractional loss of 
accuracy ig theodore S 0-5712 or 2-7 per cent., i.e. equivalent to 4 of the 
162 families in the sample. 


7.23 Determination of errors due to bias 


As has been pointed out in Chapter 2, bias can arise either in the selection 
of the sample, or in the estimation process. 

Although biased methods of estimation can in general be avoided, occasions 
arise, such as that discussed in Example 7.17, where biased estimates are 
considerably more accurate than the corresponding unbiased estimates. The 
biased estimates are also sometimes considerably simpler to calculate than the 
unbiased estimates, For these reasons it is sometimes advisable to use biased 
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estimates for comparative’ purposes. Such estimates can clearly be used with 
more confidence if the amount of bias that they introduce is not large. 

In general it is possible to make an estimate of the expected magnitude of 
a bias in estimation by a combination of mathematical analysis and detailed 
numerical analysis of the sampling results. Such methods, however, are 
complicated and vary with the type of sampling adopted. We shall therefore 
not describe them here. 

An alternative and relatively simple method is to compare the biased 
estimates with the corresponding unbiased estimates of the same quantity. 
Any one comparison will of course be affected by random sampling errors in 
both the estimates, but if the material is sufficiently extensive to provide a 
number of comparisons, the mean difference will provide an estimate of the 
average bias whose accuracy can be judged by the variation of the individual 
differences. 

Bias in the selection of the sample can in general only be assessed by 
comparison with another sample known to be free from bias. If, however, 
the distribution of some supplementary variate is known, bias in selection 
can sometimes be assessed, and if necessary eliminated, at least in part, by 
the use of regression. The calibration of eye estimates by means of regression, 
described in Section 6.15, provides an example of this procedure. As already 
pointed out, the procedure is subject to qualifications and cannot be relied on 
to compensate for all possible sources of bias. The only certain guarantee 


that bias is absent is the use of methods of selection and observation which 
are free from bias. 


Example 7.23 


Assess the evidence for bias in the estimate of the dressing of nitrogen 
per acre derived from the unweighted mean over all fields of Example 6.19. 


The results already given in Tables 5.19.b and 6.19.c show that in each 
size-group the unweighted mean dressing is less than the weighted mean. 
The apparent bias in the overall unweighted mean dressing is — 0-059, A 
larger number of comparisons of the same type may be obtained by dividing 
the 22 farms of the small-size group into two groups of 11 farms each, and 
the 36 farms of the medium-size group into four groups of 9 farms each. 
Division of Table 6.19.a into blocks in this manner gives the comparisons 
shown in Table 7.23. 

Six out of the seven differences are negative. The mean difference is 
— 0-040, and the standard error of this difference, estimated from the sum 
of the squares of the deviations of the individual differences from their mean, 
is + 0-022. The evidence for bias on this small amount of data is therefore 
not conclusive. 

The procedure has here been adapted to the data given in Table 6.19 ,a. 
If the sampling were systematic the division of the groups should be made 
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by a random or systematic process such that each of the sub-groups constitutes 
a substantially random sample of the whole of the group. The procedure is 
also subject to the qualification that any bias arising from differential weighting 
of the different size-groups will be excluded by this method of estimation, 
since the differences are based on comparisons within size-groups. 


TABLE 7.23—EsTIMATION OF BIAS 


Weighted Unweighted Difference 

mean mean 

521 ASS — -037 
Small farms 

378 +362 — -016 

“584 -537 — -047 

"417 "446 + -029 
Medium farms 

-503 “470 — -033 

“378 “361 — -017 
Large farms š « "576 416 — -160 


An alternative method of subdivision which overcomes this limitation is 
to form sub-samples in which all the size-groups are represented in the correct 
Proportions. If the data from a number of counties are available, no subdivision 
will be necessary, since the differences between ’size-group means and between 
the overall means for the different counties will provide all necessary 
comparisons. 


7.24 Interpenetrating samples : comparison of observers 


The error variance of the difference between two observers, estimated from 
interpenetrating samples, can be obtained by calculating the error variance 
appropriate to each observer, and adding these variances. This procedure, 
however, is subject to certain qualifications. In the first place the correction 
for finite sampling must not be applied. In the second place only those 
components of variance must be included which affect the comparisons between 
the observers. Thus, if a two-stage sampling process is adopted and each of 
the selected primary units is sampled by both observers, only the second-stage 
sampling error will enter into the comparison between observers. 

If the data relevant to the comparison between two observers are at all 
extensive it is often possible to make a direct estimate of the error of this 
comparison by subdividing the material so that a number of independent 
differences are obtained, in the same manner as in the estimation of bias from 
different methods of estimation. Thus, with the above two-stage sampling 
Process the difference between the observers might be obtained for each 
primary unit separately. 
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7.25 Estimation of the sampling error from duplicate samples 


If a survey is carried out in two or more interpenetrating parts and the 
results are tabulated separately, an estimate of the sampling error can be 
obtained from the differences of the two samples. For such an estimate to be 
of any value there must be a number of independent differences, so that at 
least a moderate number of degrees of freedom are available. Even with 
extensive surveys the number of available differences is likely to be small, 
so that such estimates are usually rather rough. Nevertheless they are useful 
when the detailed results are not available. 


If the two samples are distinguished by single and double dashes, and the 
estimate of the population total is given by the sum of the ¢ parts 1, 2, etc., 
we have Y'=Y,'+ Y.’+..., and Y” =Y,” +Y; +... If the sizes 
of the two samples are in the ratio 2:4, where 2+ u =1, the estimate Y 
of the population total from the two samples is AY’ + uY”, with similar 
expressions for Y}, etc. An unbiased estimate of the error of Y is given by 


V (1) = Awl —F) {00 — Yr" + (Ve! — Yo")P +. ah 


where f is the sampling fraction for the whole survey. When the parts vary 
considerably in size this estimate is very inefficient, since excessive weight is 
given to the larger totals. If the approximate relation between the variances 
of the totals of the parts is known, a more efficient estimate can be obtained, 
though this will be biased if the assumed law of variance is incorrect. If the 
variances are proportional to Y}, Yo, etc., the efficient estimate is 


via) = Yau £ =p) 


{Chl — Vi, + 0 — WY + 2 3 


This law of variance is likely to be approximately true for area surveys if the 
density per unit area of the quantity surveyed does not vary very greatly from 
part to part. 

If the variances are proportional to some other quantity, such as the number 


of units in each part, these numbers must be substituted for Yi, Yo, etc. and 
their sum for Y in the above formula. 


Example 7.25 


In the 1942 Census of Woodlands the total volumes of timber for the seven 
regions of the survey obtained from the first and second 5 per cent. samples, 
excluding areas surveyed in 1938-9, and with allowance for felling in the 
interval between the two samples, are shown in Table 7.25. Estimate the 
sampling error to which the combined estimate of the total volume of timber 
for the country is subject. 
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TABLE 7.25—-VOLUMES OF TIMBER IN THE DIFFERENT REGIONS ESTIMATED FROM 
THE FIRST AND SECOND 5 PER CENT. SAMPLES OF THE 1942 CENSUS OF 


WOODLANDS 
Volume, m. cu. ft. 
A — B)* 
Region =B ( TE 
Sample A | Sample B Mean 

A 253 201 227 +52 11-9 
B 74 84 79 —10 1-3 
Cc 100 107 104 =} 0-5 
D 148 164 156 — 16 1-6 
E 209 227 218 — 18 1-5 
F 94 78 86 +16 3-0 
G 112 119 116 — 7 04 
990 980 986 +10 20-2 


The computations are shown in the last three columns. The sum of 
squares of the differences A — B is 3738, and therefore from the first formula 
the standard error of the total is 


V(4 X $ X vo X 8738) = + 29-0 m. cu. ft. 


The sum of the last column is 20-2, and therefore by the second formula the 
Standard error is 


V(986 X 4 X EX OXF X 20-2) = + 25-3 m. cu. ft. 


It will be noted that an estimate of the standard error of any regional total 
can be obtained directly from the second formula by substituting this total 
for the grand total 986. 

The above estimates are very rough, since they are based on only 7 degrees 
of freedom. ‘The estimate obtained by the method described in Section 7.18 
is + 19-7 m. cu. ft. With a more extensive set of comparisons between the 
two samples a more accurate estimate could be obtained. 


7.26 Presentation of sampling errors in extensive surveys 


In the Preceding sections the methods of estimating the sampling error 
of single estimates, e.g. of the population mean, have been described. In 
extensive surveys the results will usually be broken down in various ways, 
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so that the final tables will contain a large number of estimates relevant to 
different parts of the population. The errors of these different estimates could 
be calculated separately by the methods already given, using pooled estimates 
of the error variance per unit where appropriate, but such a procedure would 
be laborious, and even if it were carried out, the presentation of the separate 
standard errors in the tables of results would make these tables somewhat 
confused. 

From every point of view, therefore, it is desirable to have some condensed 
method of presenting the errors of the component parts of elaborate tables. 
This can be effected by making use of some error law, either theoretical or 
empirical, which will enable the standard error of any particular component to 
be rapidly obtained when required from other information available in the 
table. The exact form of the law will depend on the type of material and the 
nature of the information tabulated. 

The simplest case is that in which the tables contain means of a quantitative 
variate derived from a survey with constant sampling fraction, and the error 
variance per sampling unit is constant. The standard error of any mean then 
depends only on the value of the sampling fraction, the error variance, and the 
number of units on which the mean is based. These numbers are likely in 
any case to be of interest in themselves and to be presented either directly or 
in raised form as estimates of the numbers of units in the different parts of the 
population. Alternatively, if the actual numbers of units in the different parts 
are known these may be presented. 

Whatever the exact form of presentation an auxiliary table can easily be 
prepared by the use of formula 7.2, or its analogue for stratified sampling, 
giving the standard errors corresponding to different numbers in the population 
or sample. If a table is felt to be unduly elaborate the formula on which it is 
based may be presented. If the standard errors are likely to be used mainly 
for testing the differences between different means the correction for finite 
sampling can be omitted, with of course a note to this effect. It may be noted 
that with a table of this type the standard error of the difference of two means 
based on numbers n, and m, can be obtained by entering the table with the 
number 2’, given by I/n’ = 1/m + 1/ny. 

Another simple case is that in which the summary table relates to qualitative 
data based on the proportion of units possessing a given attribute. If the 
sample is random the standard error of any entry depends only on the sampling 
fraction, the proportion of units Possessing the given attribute in the part of 
the population under consideration, and the number of units. If the variation 
in the proportion of units is not large in the different parts of the population, 
a table or formula based on the Proportion in the whole population may be 
sufficiently accurate. If the proportion of units possessing the attribute is 
small in all parts, q can be taken equal to unity. 

An example of the use of an empirical law is provided by the case in which 
the error variances are approximately Proportional to the magnitude of the 
totals of the different parts, which, as pointed out in the last section, 
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to hold for certain types of area survey. In this case a formula or table relating 
the errors to the entries of a table of totals can be presented. 

There is not space here to discuss more complicated cases, which must 
be dealt with on their merits as they arise. With the more elaborate types 
of sampling the possibilities for presenting the standard errors in the form 
of auxiliary tables are more limited, but even in such cases it is often possible 
to summarize the standard errors in the form of a few relatively simple formule, 
suitable for rapid calculation on a slide rule. 
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EFFICIENCY 


8.1 General remarks 


The methods described in Chapter 7 enable the sampling error associated 
with a sample of a given type and size to be calculated from the data furnished 
by the sample itself. When planning a sample census or survey, we have 
to solve the more general problem of calculating the sampling errors of samples 
of various types and sizes from the data furnished by a sample of a particular 
type and size. We can then determine which method of sampling is likely 
to be most efficient and the size of sample necessary to give the required 
accuracy. 

The determination of the sample size in the case of a random sample from 
a large population has already been discussed in Section 4.31. It was there 
shown that, for qualitative characters which are attributes of the sampling 
units, the number of units required could be determined without any prior 
knowledge of the material other than the approximate proportion of units 
Possessing the given attribute in the population ; and that for quantitative 
characters knowledge of the standard deviation of the character in question 
per sampling unit was all that was required. 

The formule of Section 4.31 apply when the population is large 
relative to the size of sample required. If the population is not large a 
correction must be made to allow for finite sampling. This is most simply 
done by calculating the number of units nọ that would be required if the 
population were large, and the corresponding sampling fraction So = n/N. 
The required sampling fraction is then given by 


LF fo En 


In this calculation fọ may be greater than unity. 

The method followed in Section 4.31, i.e. that of taking the appropriate 
formula for the standard error of a sample of size n and rewriting this 
formula to give an equation for n, is a general one and can be applied 
to the more complicated types of sample, using the appropriate formule 
for the standard errors given in Chapter 7. It is apparent, however, that 
these formule can only be used if the relevant variances per sampling 
unit are known or can be estimated. In certain cases, also, the formule 
cannot conveniently be rearranged so as to give n directly. This, however, 
is a minor point, since the required solution can always be quickly found by 
trial once estimates of the relevant variances are available. 

In the following sections we will discuss the problems that arise in the 
estimation of the variances relevant to different types of sample when the 
basic data consist of a sample of a different type. In certain cases data relating 
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to all the units of a population will be available. This situation does not differ 
in any essential particulars from that in which data are derived from a random 
sample of the population. 

We will here define the sense in which we shall use various terms in the 
subsequent discussion. 

The relative accuracy of two samples which differ in respect of method 
of sampling or size of sample, or both, may be defined as the reciprocal of the 
ratio of the sampling variances of the estimates provided by them. 

The relative precision of two different methods of sampling based on the 
same type of sampling unit may be defined as the reciprocal of the ratio of the 
sampling variances of the estimates given by the two methods when the same 
number of units are taken. 

The relative efficiency of two different methods of sampling based on the 
same type of sampling unit may be defined as the reciprocal of the ratio of the 
numbers of units required to attain a given accuracy with the two methods. 

In the case of a random sample from a large population, or a stratified 
sample with fixed strata from such a population, the relative efficiency is equal 
to the relative precision. But if the size of the strata depends on the number 
of units in the sample, or if the population is not large relative to the size of the 
sample, there is a difference between the two concepts. 

The term efficiency is already in current use in the theory of estimation. 
It is there used in an absolute sense. An estimate is efficient (i.e. has an efficiency 
of 100 per cent.) if in large samples it is one of the class of most accurate 
estimates, i.e. estimates with minimum variance. An estimate has an efficiency 
of x per cent, if it has 100/x times this minimum variance. This use of the 
term is analogous to precision in our terminology. The reason why no distinction 
has to be made between precision and efficiency in the theory of estimation is 
that only large populations are normally under consideration, in which case 
the two concepts are synonymous. Since no confusion is likely to arise, we 
shall continue to use the term efficiency when discussing the relative accuracy 
of different estimates derived from the same sample. 

The concepts of relative precision and relative efficiency may be extended 
to cover methods of sampling based on different types of sampling unit, by 
replacing numbers of units by the amount of material included in the sample. 
They may be further extended to cover the relative accuracy for a given cost 
and the relative cost for a given accuracy. 

It may be noted here that the relative precision and relative efficiency of 
different types of sampling should as far as possible be judged from estimates 
of the sampling variances derived from the same set of data. Comparisons 
based on estimates derived from independent samples of different types are 
subject to errors of estimation which are considerably larger, and comparisons 
based on samples from different aggregates of similar material are even more 
subject to uncertainty. No very general conclusions should, however, be 
drawn from a single comparison based on a small amount of data, even when 
a single set of data is uscd. The relative precision of stratified and random 


247 


SECT. 8.2 SAMPLING METHODS FOR CENSUSES AND SURVEYS 


samples, for instance, will depend on the differences between strata, and these 
differences may vary considerably even in apparently similar material. 


8.2 Qualitative data 


If the variates under consideration are attributes of the sampling units, the 
effect of stratification, with either uniform or variable sampling fraction, can 
be determined from a knowledge of the proportions of units possessing the 
given attribute in the different strata. In other cases qualitative variates must be 
treated similarly to quantitative variates, as in the estimation of sampling errors. 

Formule for the required size of a stratified random sample with uniform 
sampling fraction, analogous to those for a random sample given in Section 4.31, 
can be written down without difficulty. A somewhat simpler approach, however, 
is to estimate the percentage standard error of a stratified sample of any 
convenient size (e.g. the size of the sample of which the data are available) 
on the assumption that the population is large. The size of sample required 
to give any predetermined percentage standard error is then given, if the 
population is large, by the formula 


Size of sample required (Actual percentage standard error)? 
Size of actual sample (Required percentage standard error)? 


Allowance for the effect of finite population size can then be made by 
formula 8.1. 

In the case of a stratified random sample with variable sampling fraction 
the same procedure can be followed, with the exception that allowance for the 
effect of finite population size cannot be made in the above manner. If, 
therefore, any of the correction factors (1 — fi) are sufficiently large to be of 
importance, the approximate size of sample required may first be calculated 
as above and the final size found by trial. Variable sampling fractions, however, 
are not likely to be much used for qualitative data. 


Example 8.2.a 


If a large population of individuals is divided into five strata containing 
equal numbers of people, determine the relative sizes of a stratified and a fully 
random sample of the same accuracy when the percentages of individuals 
giving a positive answer to a given question in the different strata are (a) 70, 
60, 50, 40 and 30 per cent, (b) 10, 74, 5, 2% and O per cent. 


A sample of 500 people will have 100 in each stratum. The variance of 
the number in the sample giving positive answers will be, in case (a), for a 
stratified sample, 


100 X +7 X 34-100 x -6 x :4+100x 5 X 5-100 X -4x “6-100 3x-T—115 
and for a random sample, 
500 x -5 X -5 = 125 
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The ratio of the required sizes is therefore 125/115 = 1-087, ie. the random 
sample will have to be 8-7 per cent. larger. In case (b) a similar calculation 
shows the random sample will have to be 2-7 per cent. larger. 


Example 8.2.6 

Determine from the data of Examples 6.5 and 7.6.a the numbers of 
farms required to give a sampling standard error of 5 per cent. in the estimate 
of the number of farms growing wheat (@) when the sample is random, (6) when 
it is stratified by size-groups. 


(a) We have p = 54/125 = 0-432. Consequently, from Sections 4.31 
and 8.1 


i=- gaa 8 
fo = 526/2496 = 0-210 
= _ 0-210 _ = 0-174 
=T+ 0-210 
n = 433 


(b) We have already found that for the stratified sample U = 1080 and 
S.E. (U) = + 71-64. If the population had been large, therefore, we should 
have had S.E. (U) = 71-64//(1 — 1/20) = 78-5. Consequently in this case 
S.E. % (U) = 6-80. Hence 


no _ 6:80? 
(Pty oe 
Ny = 231 


fo = 231/2496 = 0:0927 
f= 0-0927/(1 + 0:0927) = 0:0848 
n = 212 


The standard error of the total estimated from a random sample of 
125 farms is 204/(125 x 0:432 X 0-568 x 19/20) = 108-0. Consequently the 
relative precision of the stratified and random samples of 125 units (or indeed 
of any number of units) is given by 108-0%/71-64® = 2-27. The relative 
efficiency, when a 5 per cent. standard error is required, however, is 
433/212 — 2-05. The relative efficiency is slightly less than the relative 
precision because we are sampling from a finite population. 


8.3 Random sample and stratified sample with uniform sampling 
fraction 


‘The general principle to be followed is to construct an analysis of variance 
which corresponds as closely as possible to that appropriate to the required 


249 


SECT 8.3 SAMPLING METHODS FOR CENSUSES AND SURVEYS 


type of sample. The procedure varies somewhat according to the type of data 
available. 


(a) From the data of a stratified sample with uniform sampling fraction : 


An analysis of variance within and between strata in the form of Table 
7.7.a must be made. The within-strata mean square s,? gives an estimate 
of the error variance per unit in a stratified sample, and the mean square s? 
from the total line gives a similar estimate for a random sample. If separate 
estimates of the error variance per unit have been made for the different strata, 
as in Example 7.6.a, a pooled within-strata sum of squares may be calculated 
by multiplying the within-strata error variances by the degrees of freedom 
ni — | for each stratum, and summing the products, or by summing the sums 
of squares directly. 

The formula of Section 4.31 can then be used to determine the size of 
sample, using sı? in place of s? for a stratified sample, and correcting for finite 
population in the same manner as in Section 8.1. Since for a stratified sample 
V(¥) =s? (1 — f )/n, and for a random sample V (¥) = s (1 — f )/n, the 
relative precision of stratified and random sampling will be given by the ratio 
of s?/s,?. The relative efficiency will be somewhat less than the relative precision 
when the corrections for finite sampling are appreciable. 


This procedure is approximate in two respects. In the first place, if the 
variances within the different strata are unequal they do not enter into the 
mean square B with quite the correct weights, as already explained in Section 7.7. 
In the second place, a stratified sample has a slightly greater overall y 
per unit than a random sample from the same population, 
C is not the best estimate of the variance per unit of a random sample. Neither 
of these approximations gives rise to errors of any importance in the comparison 
of a random and a stratified sample, but it may be noted that the bias in C 
can be almost completely eliminated by calculating s? from the formula 


è = {(n — 1) C + B}/n (8.3) 


An extension of this formula is of use in the case of multiple stratification 
(Section 8.4). Method (c) below takes account of both sources of disturbance., 


ariance 
and consequently 


(b) From the data of a random sample : 


An analysis of variance within and between strata can be made in the same 
manner as with a stratified sample with uniform sampling fraction, and s,2 
and $ can be estimated as in a stratified sample. 

For this procedure it is only necessary that the units of the sample be 
classified by strata. The numbers of units of the whole population falling in 
the different strata do not require to be known. 

If these numbers are known, Method (c) below can be followed. This 
will give slightly more accurate results at the cost of a little additional 
computation, since allowance is made for the fact that the numbers in the 
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different strata in the sample will not be exactly proportional to the numbers 
in the population, owing to the fluctuations of random sampling. 


(c) From the data of a stratified sample with a variable sampling fraction 
(or any arbitrary values of the sampling fractions) : 

Estimates of the average within-strata mean square s,* and of the overall 
mean square must be calculated from the proportions i = Ni/N of the units 
of the population in the different strata. The formule are 


2 — y2 — D hi (1 — hi) sP/m 


where ¥ is the estimate of the population mean derived from the sample, and 
is consequently equal to i Ji. The relation of these formulz to the analysis 
of variance of a stratified sample with uniform sampling fraction will be 
apparent. The terms involving y in s* correspond to the between-strata 
component of variance, the last term of s? being the correction required because 
the Ji are themselves subject to sampling error. ‘This correction will be trivial 
except when the between-strata component of variance is small and there are 
a large number of strata with few units from each stratum. If the hj are put 
equal to z/n (uniform sampling fraction), s$ will be the same, to order 1/n, 
as that given by the mean square C of Method (a), with the exception that in 
Method (a) = hi jë — Y° is multiplied by a factor n/( — 1). 

It will be noted that the data need not be derived from a sample in which 
the sampling fractions are chosen with the object of obtaining the most accurate 
Possible estimates: any set of data in which the sampling is random within 
strata, and from which the proportions of the units in the different strata, 
the strata means and the within-strata variances can be determined with sufficient 


accuracy, will be adequate. 


Example 8.3.a 


Determine the error variances per unit and the relative precision of a stratified 
random sample with uniform sampling fraction and a fully random sample 
from the data on wheat acreages of the stratified random sample of Hertfordshire 
farms (Examples 6.5 and 7.6.a). 


The analysis of variance is given in Table 8.3.a. The within-strata sum 
of squares is obtained directly from Table 7.6.a by summing the column 
Si (y — Fi)?. The between-strata sum of squares is obtained by summing 
the products of the columns of Table 6.5.b giving the totals and means, and 
deducting the product of the general total and the general mean. These 
means should be taken to two and three decimal places respectively. We 
thus have s,? = 349-2 and s* = 797-1. The estimate of the relative precision 
is therefore 2.28, 
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TABLE 8.3.a—ANALYSIS OF VARIANCE OF THE STRATIFIED RANDOM SAMPLE OF 
HERTFORDSHIRE FARMS 


Degrees | Sum Mean 


| 

| 

| of freedom | of squares | square 

| | 

| 
Between size-groups . | 5 57,278 | 

| 
Within size-groups | 119 41,558 | 349-2 
Whole sample . | 124 | 98,836 | 797-1 


Example 8.3.b 


Make similar estimates to those of Example 8.3.a, using the data of the 
random sample (Examples 6.6, 7.2.b and 7.6.b). 


The analysis of variance is given in Table 8.3.b. If the N; are not 
known, we have sı? = 488-5 and s? = 1329-9. The estimate of the relative 
precision is therefore 2-72. 


TABLE 8.3.b—ANALYSIS OF VARIANCE OF THE RANDOM SAMPLE OF HERTFORDSHIRE 


FARMS 
Degrees Sum Mean 
of freedom | of squares square 
| 
Between size-groups . | 5 106,775 
Within size-groups . | 119 58,129 488-5 
Whole sample z | 124 | 164,904 | 1,329-9 


SSS | eae || 
If the N; are known the calculations follow the same lines as those of 
Example 8.3.c below, and are left to the reader. In this case we find 5,2 = 436-5 
and s? = 1189-2, the estimate of the relative precision being again 2-72. 


Example 8.3.¢ 


Make similar estimates to those of Example 8.3.a, using the data of the 
sample with variable sampling fraction (Examples 6.7 and 7.7.a). 


Table 8.3.c shows the calculations. The hi are calculated from the numbers 
in the population. These are given in Table 6.6.b, except for the last two 
size-groups, which have the values 215 and 51 respectively. It will be noted 
that we are here considering a sample stratified for districts as well as size-groups. 
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TaBLE 8.3.c—CALCULATION OF THE AVERAGE WITHIN-STRATA AND OVERALL 
MEAN SQUARES FROM THE STRATIFIED SAMPLE WITH VARIABLE SAMPLING 
FRACTION OF HERTFORDSHIRE FARMS 


- | 
Size-group hy ny Yi sè h(l—h)sF]ng 
I 5 “174 0 (0) (0) (0) (0) 
6- 20 -208 3 0 0 0 0 
21- 50 -143 6 4-5 20 53-5 11 
51-150 -208 26 8-2 67 159-2 1-0 
151-300 “160 40 29-1 847 564-2 1-9 
301-500 "086 43 | 76-6 5,868 1,703 Sel 
501- +020 17 172-1 29,618 2,614 3-0 
“999 135 17-03 1,249-3 329-8 j 10-1 
301-500 “S11 43 76-6 5,868 1,703 6-1 
501- +189 17 172-1 29,618 2,614 23-5 
1-000 60 94-65 10,357 1,875 29-6 
301- -106 60 94-6 8,959 3,243 5-1 
-999 135 17-03 1,102-0 474-8 9-1 


The sums of the products of hi with Ji, Ji? and sj? are shown at the foot 
of their respective columns. We therefore have, since 17:03? = 290-0, 
sı? = 329°8 
$2 — 329-8 + 1249-3 — 290-0 — 10-1 = 1279-0, 
Hence the relative precision is 1279-0/329:8 = 3-88. It will be seen that the 
corrections in the last column are here trivial, and could well be omitted. 
Size-groups 301-500 and 501- can be combined in the manner shown in 
the second part of Table 8.3.c. We have, for these two size-groups combined, 
s2 = 1875 + 10,357 — 8959 — 30 = 3243 


We can now insert a fresh line in Table 8.3.c to replace the lines for the 
last two size-groups in the first part of the table. The previous computation 
is then repeated, giving 

s? = 4748 
s2 = 474-8 + 1102-0 — 290-0 — 9-1 = 1277-7 
Hence the relative precision is 2°69. vi 

The amalgamation of the two size-groups containing the largest farms 
has resulted in a considerable loss of precision, the relative precision being 
3298/4748 = 0.69. 
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8.4 Multiple stratification 


The gain in precision due to sub-stratification of a sample which is already 
stratified into main strata can be estimated by methods similar to that of 
Section 8.3. An example has already been given in Example 8.3.c, where 
the gain in precision resulting from the subdivision of the size-group 301— 
into two groups, 301-500 and 501— was determined. | 

If the data are derived from a sample with uniform sampling fraction 
which is itself sub-stratified, the comparisons can be made directly between 
the relevant mean squares in the analysis of variance, as in Method (a). The 
structure of the analysis of variance in this case is 


{ Between main strata 


Whole sample (4 ee z 2) J Between sub-strata 
| Within man stata. (a tee sub-strata (s,*) 
The ratio of the mean squares s,? and s? within main strata and within sub- 
strata will give the required relative precision. 

A similar analysis can be constructed from data derived from other types 
of sample with uniform sampling fraction (Method (b)). 

One case of practical importance is that in which both the main and 
sub-strata are arbitrary subdivisions of an area, all the main and all the sub- 
strata being of equal size. If there are ? main strata, and £” sub-strata per 
main stratum, with k selected sampling units per sub-stratum, the analysis 
of variance will be of the form shown in Table 8.4. 


TABLE 8.4—STRUCTURE OF THE ANALYSIS OF VARIANCE IN A 
DOUBLE STRATIFICATION 


Degrees Mean 

of freedom square 
Between main strata . . . . ¢—1 A 
Between sub-strata v(t” — 1) D 
Within main strata + Within sub-strata tt’ (k —1) E 
Total à " - Ut’ k— l) B 

Total for sample x ‘ vt’k—1 C= 


If k is small the bias in the estimate s,2 provided by the within-strata mean 


square may be appreciable. This bias can be almost completely eliminated 
by using the formula 


s? = {(t" k — 1) B+ Eyt’ k 
which is derived directly from formula 8.3. 


8.5 Stratified sample with variable sampling fraction 
In the notation of Section 8.3, we have 
NV (F) = 2s? hi (1 —fiifi 


The ni or fi required for a given accuracy can only be determined uniquely 
from this equation if the relations between the different f; have been decided. 
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It has already been pointed out (Section 3.5) that for maximum accuracy the 
fi should be proportional to ci, but that in many types of material stratified 
by size-groups the fi may be taken proportional to the mean sizes of the size- 
groups. If we put fi = c Ai, where the 2i are in the required proportions, 
the above equation can be written 


l 
z 8 (s? hil 2i) =N V (F) + È s? hi 


The value of ¢ for any required accuracy can then be calculated. If, however, 
a value of c is obtained which makes some of the fi greater than 1 the calculation 
must be repeated, omitting the terms for these strata from both sides of the 
equation. 

Alternatively the direct expression for V (F) can be used and the value of c 
found by trial. This has the advantage that the effect of adjustments of the 
final sampling fractions to simple fractions is immediately apparent. 

The relative precision of stratified samples with variable and with uniform 
sampling fractions can be obtained by calculating V (¥) for both samples. 
It should be noted that if the fi have been taken proportional to the s; a slight 
over-estimate of the relative precision will be obtained, owing to errors in the 
si. This point has been discussed by Sukhatme (1935, A), but is not of great 
importance in practice. . 

It will be seen that for these calculations ‘we only require sufficiently accurate 
estimates of the variances within strata and the proportions of units of the 
population in the different strata. The procedure is therefore the same whether 
or not the sample from which the data are obtained is stratified. All that is 
required is that all strata should be adequately represented. * 


Example 8.5.a 

From the data of Table 8.3.c determine the size of sample required to give 
a standard error of -+ 1500 acres in the estimate of wheat acreage, when 
Sampling fractions proportional to those of Table 3.7.a are used. 

The J; can be taken equal to the sampling fractions of Table 3.7.a. 
Tabulating s; hi and si? hi/Ai, We find 

D (si? hi/ Ai) = 2918 È sê hi = 329-77 
Also NV (y) = V (Y)/N = 1500/2496 = 901-44, and hence 
c = 2913/{ 901-44 + 329-77} = 2-37 

The total number required in the sample is therefore 135 x 2-37 = 320, 
the number in the largest size-group, for example, being 51 x 2-37/3 = 40. 
No sampling fraction is greater than 1, and therefore no further computation 
is required. 

In practice the new sampling fractions may well be rounded off, taking, 
for example, all of the largest size-grouP, 4 of the next, etc. 

* The case in which domains cut across strata is discussed in Section 9.5. 
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The direct approach illustrates the way in which results of this kind can 
readily be obtained by interpolation. A first approximation to the number 
required is given by 135 x 2550°/1500° = 390. The standard error 
corresponding to this sample number, obtained by the ordinary methods, is 
+ 1302. The squares of the reciprocals of this and of the original standard 
errors can be plotted against the respective numbers in the samples and a 
smooth curve drawn through these two points and the origin. ‘This curve 
gives the general relation between sample size and accuracy, and will be found 
to give a sample number corresponding to a standard error of + 1500 of 
approximately 320. 


Example 8.5.b 


Determine the relative precision of the sample of Table 3.7.a and the 
sample with sampling fractions proportional to s; containing the same number 
of farms. 


The standard error, using these sampling fractions, can be calculated in 
the ordinary manner, and is found to be + 2420. The relative precision is 
therefore 24202/2550? = 0-90. There is consequently an apparent loss of 
precision of approximately 10 per cent., but the real loss is likely to be less 
than this, owing to errors in the estimates of the standard errors. 

This apparent loss refers to a single variate, acreage of wheat. If, for 
instance, the acreage of some other crop were taken, the c; would be different 
and the sampling fractions required to give minimum variance would therefore 
also be different. Consequently, if several variates have to be determined, 
a compromise will in any case be required. 


8.6 Supplementary information 


The determination of the number of units required in a sample when 
supplementary information is available presents no essentially new problems. 
It has been shown in Chapter 7 that apart from the substitution of sq? or s? 
for s* the formule for the variances of estimates based on supplementary 
information differ little from those for estimates from similar samples without 
supplementary information. Consequently it will usually be sufficient to 
estimate the appropriate variance by the methods given in Chapter 7, using 
this variance instead of the ordinary variance per unit to determine the size 
of sample. The factor x?/%* in the variance of the ratio estimate differs from 
unity only because of sampling fluctuations in #, and can be omitted. 

When the ratio method is to be used and Vx (r) is virtually constant for all x, 
it will often be advantageous to estimate this variance rather than sq. This 
will generally lead to somewhat simpler and more straightforward computations. 
Any slight bias introduced into the estimates of error will be of little consequence, 
since it will merely result in a slightly larger or smaller sample being taken. 
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We frequently require an estimate of the gain in precision due to the use 
of supplementary information. This is needed in planning a sample survey 
when a decision has to be reached whether supplementary observations should 
be taken. It is also required in the planning of the computations in order to 
decide whether the utilization of available supplementary information is worth 
the additional computational labour. 

In the case of the regression method the relative precision is very simply 
calculated, since it depends only on the value of the correlation coefficient 7, 
being in fact 1 

1-r 
In calculating + due regard must be had to any restrictions imposed by 
stratification, the same sums of squares and products being used as in the 
calculation of the regression coefficient and the residual error. The above 
expression is approximate in that the reduction by 1 of the error degrees of 
freedom with the regression has been ignored, but this correction will be small 


relative to errors in the estimation of 7. 
If an arbitrary value bọ of the regression coefficient is used the relative 


Precision will be 1 
POEL eee 
T—P+Pr (1 — bolb) 


The corresponding expression for the ratio method is obtained by writing 
F for bo. 


Example 8.6.a 


From the data of Example 7.17 calculate (a) the number of farms required 
to give an unbiased estimate of the mean dressing of nitrogen per acre over 
the farms of the county with a standard error of + 0-05 cwt., and (b) the 
number of farms required in each of two equal groups so that the comparison 
based on the unweighted means of the dressings per acre of the two groups 
has a standard error of + 0:05 cwt. 


(a) The required number is 67 X 0-03922/0-05? = 41. The correction for 
finite sampling is trivial in this example. Note that either sq? or s? can be 


used to arrive at this result. ; A 
(b) If the required number in each group 1s #, the variance of the difference 


of the means is 2 s/n, Hence n = 2 x 0-0541/0-05? = 43. 


Example 8.6.b 
Obtain the expressions for the relative efficiencies given in Example 7.12.b 
from the above formule. 


_ We have 60-6397, r= 52,069/4/(115,266 x 82,296) = 0-537 and 
F = 147-36/132-08 = 1-116. Hence the relative precision of the regression 
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method, compared with the sample plots only, is 1/(1 — 0-537?) = 1-40. 
Similarly the use of differences (bọ =1) gives a relative precision of 
1/{1 — 0-5372 + 0:5372 (1 — 1/0-6327)?} = 1-23, the ratio method (by = 1-116) 
gives a value of 1-14, and the regression (b = 0-55) gives a value of 1-39. These 
correspond to the relative efficiencies already tabulated except in the case 
of the regression, for which we have here neglected the correction for degrees 
of freedom. 


Example 8.6.¢ 


Determine the gains in precision in the estimation of wheat acreages from 
the random sample of Hertfordshire farms due to the use of supplementary 
information on acreages of crops and grass, (a) using the ratio method, and 
(b) using the regression method, without taking account of districts. 


The standard errors, already obtained, are + 7950 for direct estimation 
without the use of supplementary information (Example 7.2.b), + 3940 for 
the ratio method (Example 7.8.a), and + 4126 for the regression method 
(Example 7.12.a). The apparent gain in precision due to the ratio method 
is therefore 7950?/3940? = 4-07, and that due to the regression method is 
79502/4126? = 3-71. 

The value for the regression appears anomalous, since the formule given 
above indicate that regression may be expected to be at least as efficient (apart 
from the change in degrees of freedom) as the ratio method. The discrepancy 
is due to the inclusion of the factor X?/%* in the variance of the ratio estimate. 
Using the above formule with r = 0-8555, b = 0:1932, F = 0-1522, we find 
that the relative precision, compared with direct estimation, is 3-73 for the 
regression method, and 3-32 for the ratio method. An alternative estimate of 
the relative precision of the regression and the ratio methods is therefore 
3-78/3-32 = 1-12. This latter value gives a better indication of the average 
value of the relative precision of the two methods. 


8.7 Two-phase sampling 


The only case which presents any new features is that in which the first- 
phase information is used as supplementary information to improve the 
accuracy of estimates of the second-phase variate y. It has already been pointed 
out in Section 7.8 that the variance of a two-phase sample is in this case made 
up of two parts A and B, where 


A = variance due to the first-phase sampling, z.e. the variance which would 
be obtained if y were determined for all the units of the first-phase 
sample, 


B = variance due to the second-phase sampling of the first-phase sample 
(regarded as without error). 
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To determine B the methods given in Chapter 7 for supplementary 
information are followed, the effective sampling fraction being n/n. To 
determine A we must use the methods given in the present chapter for the 
evaluation of the error of a sample of one size and type from the data of a 
sample of a different size and possibly different type. Thus, if the first-phase 
sampling is random, and the second-phase sampling is stratified with a variable 
sampling fraction, it is necessary to calculate the variance of an unstratified 
random sample of 7, units from the data of a stratified sample with variable 
sampling fraction of n units. 

Once A and B have been determined the calculation of the relative precision 
of different possible sampling methods presents no difficulty. If, for example, 
we wish to ascertain the increase in precision due to taking a two-phase sample 
of n, and n, units instead of a single-phase sample of 7, units, we calculate 
what the variance A’ of a sample of 7, units would be if the first-phase sampling 
procedure were followed for a sample of 7. units. This calculation will follow 
the same lines as that of A. The relative precision is then A’/(A + B). 
Similarly the relative precision resulting from the ascertainment of the second- 
phase information on the 7 second-phase units only, instead of on all the 7, 
units of the sample, will be A/(A + B). aM 

In the simple but general case in which the population is large, and the 
methods of sampling and estimation are such that the variances of the estimates 
at each phase are inversely proportional to the numbers of units, apart from 
the factor 1 — n/n; the above relative precisions are capable of simple 
expression. If the effective variances per unit are sı? and s, with s/s) =« 
and n/n, = 2, we have 

Wi 5 sae =i (1-2 2 
Consequently the relative precision giving the gain due to the inclusion of the 
additional first-phase units is 
A’ 1 
A+B G-Aera 
Similarly the loss by not ascertaining the second-phase information over all 
the first-phase units is given by the relative precision 
A 2 
A+B (1—-AK+A 

Representative values of these fractions are given in Table 8.7. If, for 
example, the effective standard error per unit is halved by the use of the first- 
phase supplementary information, K? =}. Consequently, if we introduce 
two-phase sampling and quadruple the size of the sample for first-phase 
information only, instead of using single-phase sampling, the amount of 
information derived from a second-phase unit is increased by a factor of 2-29. 
Similarly by collecting second-phase information on only } of the first-phase 
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units instead of all the units the amount of information is reduced by a factor 
of 0-57. 


TABLE 8.7—RELATIVE PRECISION OF TWO-PHASE AND SINGLE-PHASE SAMPLING 


Two-phase sample :— n, and na $ nı and m 
Single-phase sample :— na ny 


K? 


m 
>= 
a 
n 
oe 
on 


8.8 Sampling on successive occasions 


The relative efficiency of the various estimates can be calculated from the 
variances given in Section 7.19. When the variances on the different occasions 
are the same, the relative efficiency of the various estimates, under the conditions 
set out in Section 6.22, depends only on w and the correlation r between the 
successive occasions. 

Table 8.8.a gives the efficiencies, relative to those of the overall mean, 
of the adjusted estimates of the mean on the last occasion (a) when there is 
a sub-sample on the second occasion, and (b) with partial replacement, the 
latter goes ikea for both two and a large number of occasions. Values for 
je = % and u = 4, and for various values of r, are given. With independent 
samples or a fixed sample the overall means are fully efficient. 


TABLE 8.8.a—SAMPLING ON SUCCESSIVE OCCASIONS : EFFICIENCY, RELATIVE TO 
THE OVERALL MEAN, OF THE ADJUSTED ESTIMATES OF THE MEAN ON THE LAST 


OCCASION 
u=} | B=t 
r Partial replacement Partial replacement 
Sub- | Sub- 
sample | Large sample | Two Large 
ARR | number j occasions number 
0 1-00 1-00 1-00 1-00 1:00 1-00 
25 1-03 1-02 1-02 1-02 1-02 1-03 
5 1-14 1-07 1-08 1-09 1-06 1-07 
6 1-22 1-11 Weis ee” TA 1-09 1-11 
7 1-32 1-16 1-20 1-20 1-13 1-18 
8 1-47 1-24 1:33 1:27 1-18 1-30 
9 1-68 1:34 1-65 1:37 1-25 1:59 
-95 1-82 1-41 2-10 1-43 1-29 2-02 
1-0 2-00 1-50 Inf. 1-50 1-33 Inf. 


EFFICIENCY sect. 8.8 


TABLE 8.8.b—SAMPLING ON SUCCESSIVE OCCASIONS : EFFICIENCY, RELATIVE TO 
THE DIFFERENCE OF THE OVERALL MEANS, OR TO INDEPENDENT SAMPLES 
(VALUES IN BRACKETS), OF ALTERNATIVE ESTIMATES OF CHANGE 


w=s L=} 
r = 
Ya j 7a ma e Ya Yama tae nee 
0 1-00 (1-00) 1-00 (1-00) 1-00 (1-00) 1-00 (1-00) (1-00) 
125 1-02 (1-16) 1-02 (1-17) 1-02 (1-22) 1-02 (1-22) (1-33) 
5 1-10 (1-47) 1-12 (1-50) 1-09 (1-63) 1-11 (1-67) (2-00) 
6 1-18 (1-69) 1-22 (1-75) 1-15 (1-92) 1-20 (2-00) (2-50) 
of 1-32 (2-03) 1-41 (2-17) 1-27 (2-37) 1-36 (2-56) (3-33) 
8 1-60 (2-67) 1-80 (3-00) 1-50 (3-22 1-71 (3-67) (5-00) 
9 2-43 (4-41) 3-02 (5-50) 2-21 (5:53) 2-80 (7-00) (10-00) 
“95 3-99 (7-61) 5-51 (10-50) 3-58 (9-76) 5-01 (13-67) | (20-00) 


The increase in precision due to the use of partial replacement instead 
of independent samples or a fixed sample can also be obtained from Table 8.8.a. 
Thus with a correlation of 0-8 replacement of half the units gives a 24 per cent. 
increase in precision on the second occasion and a 33 per cent. increase after 
a number of occasions. With one-third replacement the corresponding 
Percentages are 18 and 30. f i 

Table 8.8.b gives similar efficiencies, relative to the differences of the 
Overall means, or to independent samples (values in brackets), of the estimates 
of change given by Yn —Yn-1 and by the weighted estimate based on the last 
two occasions only (formula 6.21.b). 

In the estimation of change the difference between the overall means of 
two independent samples is less accurate than the difference of the overall 
Means of a sample with partial replacement. This in its turn is less accurate 
than the difference between the means of a fixed sample. Thus with a 
correlation of 0-8 the weighted estimate from the last two occasions, with 
replacement of half the units, is 3-00 times as efficient as the difference of the 
means of two independent samples, but only 1-80 times as efficient as the 
difference of the overall means of the replacement sample. A repeated sample 
under these circumstances is 5:00 times as precise as a pair of independent 
samples. 

It will be noted that the estimate of change derived from the last two 
Occasions is always somewhat more accurate than the estimate Yn —Yn-1 
With a correlation of 0-8, for instance, there is a gain 1n efficiency of 12 per cent. 
When u = 4 and of 14 per cent. when u = 3 
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Example 8.8 


Estimate the relative efficiency of the -various estimates of Examples 6.21 
and 6.22. ` 


In Example 6.21, r = 0-847 and u = 4. Consequently, from Table 8.8.a, 
the relative efficiency of Jw and F is 1-21. From Table 8.8.b the relative 
efficiency of the weighted estimate of change and the difference of the overall 
means is about 2-1. The relative efficiency of the difference of the means of 
the units common to both occasions and the weighted estimate is given by the 
weight of the former, namely, 0-929. 

The relative efficiency of the estimates of Example 6.22 cannot easily be 
determined exactly, owing to the variation in the numbers of units from occasion 
to occasion. With the average value of u of } and a correlation of 0-811, the 
efficiency of Yn relative to the overall mean after a number of occasions will 
be 1-32 (Table 8.8.a) and that of the estimate y, — Yn—1 of change relative 
to the difference of the overall means will be about 1-6 (Table 8.8.b). 


8.9 Sampling with probability proportional to size of unit 


The relative precision of sampling with uniform probability and with 
probability proportional to size of unit depends on the variance laws to which 
the material is subject. The case in which the mean r for fixed x is the same 
for all values of x, and in which the variance of 7 for fixed x is a function of %2, 
may first be considered. i 

If the total size of all units is known, we shall be concerned with estimates 
of ê. If we put V (x)/%* =y we have the results shown in Table 8.9.a for 
the three variance laws there given, v being a constant. 


TABLE 8.9.a—VARIANCES OF F 


Variance of F 
Variance of y 
MERLE 27 Uniform Probability 
probability proportional to x 
v v (1 + y)/n u/n 
vjz v/nk v/nx 
v/x? v/nx? vu (1 +y) nx? 


In sampling for yield per acre in a crop estimation scheme, for example, 
the variance of the yield per acre may be expected to be about the same for 
large and small fields. If in addition there is no marked difference between 
the mean yields per acre of small and large fields, the precision of sampling 
with probability proportional to size relative to sampling with uniform 
probability will be 1 + y. m 
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If the mean 7 for fixed x varies with x tne variances of Table 8.9.a will 
be increased, and the precision of either method, or the relative precision of 
the two methods, may best be judged by direct analysis of actual data. 

If the acreage of the crop has to be determined by the sampling of fields, 
the relative precision of sampling with probability proportional to size, and 
with uniform probability, will also depend on the variance of the acreages. 
The simplest case is that in which the sampling is used to determine which of 
the fields carry the given crop, and in which the values of ¥ and V (x) are the 
same for the fields of the given crop and for the remaining fields, the number 
of fields being large. The variance of the proportion p of the total area under 
the given crop when »’ fields are taken is in this case pq /n’ with sampling with 
probability proportional to size, and pq (1 + y)/n’ with uniform probability. 
The relative precision is therefore 1 + y. 

In the case of sampling with probability proportional to size, point 
sampling will often be used. If the part of the land area which consists of 
fields cannot be recognized on the map, additional points will have to be visited 
on the ground, and these must be allowed for in assessing the total number of 
Points required. 

In the more complicated cases of sampling with probability proportional to 
size the same general approach as that adopted in the previous sections must 
be followed, using the data provided by an actual sample to determine the 
relevant variances. If the basic data are derived from a sample taken with 
probability proportional to size, s? can be calculated from the formule of 
Sections 7.15 or 7.16. The value so obtained may then be used to deduce 
the size of sample required for a given accuracy. 

If the basic data are derived from a sample taken with uniform probability 
of sclection, or if data relating to the whole population are available, the various 
Sizes of unit will occur in proportions which are different from those of a sample 
taken with probability proportional to size of unit. Consequently a different 
formula is required for the calculation of sr’. The appropriate formula for a 


random sample is 


where Fu = S (y)/S (x). If the individual values of y and r are tabulated the 
Second form of the expression is most conventent for computation. 

In the case of a stratified sample the expression within the square brackets 
must be evaluated for each stratum separately. If the number in each stratum 
is small and there is no great difference between the zi, the separate components 
can then be aggregated and divided by (7 — t). If there are considerable 


. differences between the 4; it is best to calculate sri? separately for each stratum, 


using the separate values in the calculation of V (Y).* 


sampling with probability proportional to size and 


he com i i i i 
eae eee ling fraction is discussed in Section 10.10. 


Aa 
stratifying by size with variable samp 
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Example 8.9.a 


From the data of Table 6.19.a construct a frequency distribution of the 
acreages of sugar-beet fields on old arable land in Norfolk, and hence calculate 
the relative precision of estimates of the mean yield per acre derived from a 
random sample of fields taken (a) with probability proportional to size and 
(b) with uniform probability, on the assumption that the variability of the 
yield per acre is the same for all sizes of field. 


In constructing the frequency distribution, account must be taken of the 
variable sampling fractions at the two stages of sampling. Since the raising 
factors at the first stage are nearly proportional to 7, 4, 2, the fields on the small, 
medium and large farms with a single field of sugar-beet must be counted 7, 
4 and 2 times respectively. Similarly a field occurring on a farm with 2 fields 
of sugar beet must be counted 14, 8 or 4 times, etc. 

This procedure gives the frequency distribution shown in Table 8.9.b 


TABLE 8.9.b—FREQUENCY DISTRIBUTION OF THE ACREAGES OF 
SUGAR-BEET FIELDS 


Raised No. | Raised No. 
Acreage of fields Acreage of fields 
| 

2 92 12 4 

3 43 13 8 

4 117 — 

5 48 20 8 

6 54 == 

7 63 24 6 

8 42 =; 

9 4 29 12 

10 60 — 

11 24 48 2 
587 


Following the method of Example 7.1.a for grouped data (the acreages 
being taken as the working units), we find 
#=6-681, s?=V(x)=30-47, y =30-47/6-6812 = 0-681 


Consequently the relative precision of methods (a) and (b) is 1-68. 


Example 8.9.b 


From the data of the sample of Hertfordshire parishes taken with uniform 
probability (Sample A of Section 3.11) estimate the value of sz? for a sample 


of parishes, stratified by districts, taken with probability proportional to size. 
Make a similar estimate from the data for all 91 combined parishes. 
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The data for the 91 combined parishes are shown in Table 8.9.c, the 
parishes selected for samples A and B being indicated in the table. 


TABLE 8.9.c—ACREAGES OF CROPS AND GRASS (DIVIDED BY 10), AND OF WHEAT, 
IN THE 91 COMBINED HERTFORDSHIRE PARISHES 


Dist. C. & G. Wh. Dist. C. & G. Wh. Dist. C. & G. Wh. Dist. C. & G. Wh. 
1 249 3l6a 3 264 386a 4 363 958a 5 380 491 


335 208 366 220 454 347 818b 
664 237 390 907 363 T41 
226 227 251 466 405 582 
256 220 210 426 337 
314 $ 436 217 263 371 
248 26 214 305 779 294 
333 230 558b 
2 283 612 464 227 440a 6 252 
247 624 232 307 
205 356a 210 374 
304 766b 201 
220 362 265 
344 701b 634 
237 567 228 
204 503b 
209 573 
336 728a 276 65l 
305 901 281 503 
330 515 273 604 
aa, Raoa 7 306 290b 
226 434 380 244 
220 506 ae 2a 
5 318a 
aad 251 116 


27,304 44,676 


The parishes selected for samples A and B of Table 3.11.b are indicated by 
the letters a and b respectively. 


° 


The values of x, y, and r for district 4 (sample A) are as follows: 


x 3d E 
363 958 2-6391 
227 440 1-9383 
250 518 2-0720 
289 495 1-7128 
242 565 2:3347 

1371 2976 2-1707 


Thus we have S (ry) — Fu S (y) = 958 X 2:6391 +... — 2975 x 2:1707 
= 161:3. The corresponding values for districts 2, 3 and 6 are 63-9, 116-8, 
and 0-0, with a sum of 342-0. The sample mean of x for these four districts 
is 266-4 and consequently sê = 342-0/(10 x 266:4) = 0-1284, or in acreage 
units 0-001284, 
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This value is considerably less than the value 0-002649 obtained in Example 
7.16. Each estimate, however, is based on only 10 degrees of freedom, so 
that the discrepancy is not exceptionally large. The corresponding value from 
the data for all 91 combined parishes, calculated in the same manner, is 
0-002222. This calculation is left as an exercise for the reader. 


Example 8.9.¢ 


Compare the relative precision, in the estimation of wheat acreage, of 
samples of Hertfordshire parishes taken with uniform probability and with 
probability proportional to size, by calculating the expected standard errors of 
samples of types A and B of Table 3.11.b. 


The data for all 91 combined parishes give a value of sq? of 23,483 when 
districts are eliminated and the same ratio is taken for all districts, and a value 
of 22,427 when different ratios are taken for the different districts. 

In calculating the expected standard error the formula of Section 7.10 
may be used, so as to allow for the variation in sampling fraction from district 
to district. The factors Xi2/{.Si (x)}? may be replaced by 1/ fi? since we are 
considering the average error to be expected over a series of similar samples. 
This will lead to a slight underestimation of the average error. 

We find E (1 — fi) ni] fi? = 402-83, and consequently V (Y) = 9-460 x 10° 
when the same ratio is taken for all districts, and 9-034 x 10° when different 
ratios are taken. 

Similarly, in the case of sampling with probability proportional to size, 
from the results already given in Example 7.16, and the value of s;* given in 
Example 8.9.b, we find V (Y) = 8-263 x 10°. 

The standard errors corresponding to these variances have already been 
given in Table 3.11.b. 

The relative precision of sampling with probability proportional. to size, 
and with uniform probability using a single value of the ratio, is therefore 
9-460/8-263 = 1-14. There is thus a gain in precision of 14 per cent., but 
it must be recognized that sampling with probability proportional to size will 
result in parishes of larger average size being included in the sample. Neglecting 
the disturbance due to the probability being only approximately proportional 
to size, the average size of parish in this case will be given by S (x°)/S (x), where 
the summations are taken over the whole population (or a sample selected with 
uniform probability). This gives an average size of 3244 acres of crops and 
grass, compared with the arithmetic mean of 3000 acres, i.e. an average size 
greater by 8 per cent. 


8.10 Interpretation of the analysis of variance 


The analysis of variance can be interpreted in the manner set out below. 
This interpretation is of particular use when we are concerned with multi-stage 
sampling, and with the effect of change of size of the sampling units. 
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If the units fall into groups of any kind, such as strata, the unit values of 
a variate y can be regarded as made up of the sum of two parts, one, u, which 
varies from group to group but has a fixed value for all units of a particular 
group, and the other, v, which varies from unit to unit independently of the 
groups. The variances of u and v may be denoted by U and V respectively. 
Thus v and v may be random sample values from normal distributions, though 
the condition of normality is not necessary. In this hypothetical framework 
zero mean can be assigned to the parent distribution of v without loss of 
generality, but even so the mean of the v’s for all the units of a finite population, 
or for all the units of a particular group, will not be exactly zero, and consequently 
the group means are not exactly equal to the ws. For this reason the values of 
u and v cannot be uniquely determined from the values of y. 

The mean squares of the analysis of variance provide estimates of U and V. 
If A and B are the mean squares between and within groups, C is the overall 
mean square, k is the number of units in each group, and A the number of 


groups, we have 


A=kU+V 
BEN 
H 
Ja U = (4 — B)jk 


We also have, from the analysis of variance, (hk — 1) C =h (k — 1) B 
+ (k — 1) A. Consequently if o° is the overall variance and o,? the variance 
within groups we have, from formula 8.3, 
s£ =U (h — 1)/h + Y 
te =V 
The factor (h — 1)/h is analogous to the correction for sampling from a finite 
population. 4 i 
The relative precision of stratified and random sampling will be obtained 
by taking the groups as strata. We then have, with ¢ strata, 


An alternative formulation is possible in terms of the intra-class correlation, 
ie. the correlation between members of the same stratum when the strata 
themselves are regarded as a random sample from an infinite set of similar 
strata (R. A. Fisher, Statistical Methods for Research Workers, Section 40). 
The estimate r; of this correlation is given by 

A-B U 
Y= Fea(k= UEY 


and consequently 


Looked at from this point of view, the intra-class correlation coefficient 
may be regarded as a quantitative expression of association which is alternative 
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to the ratio U/V. In this book we shall use the concept of additive components 
of variance, since this appears to be more easily capable of generalization, and 
is otherwise preferable to the concept of intra-class correlation. 

When there is compensation between the different units of the same stratum 
the definition of U as a variance breaks down, and has to be extended (see 
Yates and Zacopanay, 1935, H). Complete compensation occurs when all 
the strata means (or the first-stage units in a two-stage scheme) are equal. 
In this case K U + V = 0, i.e. U = — V/K, where K is the number of units 
in each group of the population. Negative values of U between 0 and — V/K 
are therefore admissible. 


8.11 Multi-stage sampling 


The sampling variance of two-stage sampling can be divided into two parts, 
A and B, where 
A = variance due to the first-stage sampling when there is complete 
ascertainment at the second stage, i.e. when all the second-stage units 
which go to make up the selected first-stage units are known, 


B = variance due to the second-stage sampling of the selected first-stage 
units. 


Thus the formula of Section 7.17 for V (Y) in two-stage random sampling 
may be rewritten 
oF 1—7" 
Y= 24 nz 
(62) n So° i Ron s 


(8.11.a) 


where 


r 8 (8.11.b) 


The first term constitutes part Æ and the second part B. 

The second term will be recognized as (1 — f ”)/(1 — f ) times the variance 
that would be obtained with single-stage sampling of the second-stage units, 
the same total number of second-stage units being taken, with the first-stage 
units as strata and uniform sampling fraction f. Iff” is small, therefore, the 
first term gives the increase in variance due to the adoption of the two-stage 
process. 

The above subdivision is alternative to that given in Section 7.17. Part A 
is dependent only on the first-stage sampling, being unaffected by the intensity 
or type of sampling at the second stage. This fact considerably simplifies the 
problem of determining the sampling errors for different intensities of sampling 
at the two stages: with the subdivision of Section 7.17 the variation in s? 
for different intensities of sampling at the second stage has to be taken into 
account. 

The only new point that arises in the estimation of the relevant variances is 
the determination of part A from the data of a two-stage sample. In general 

this simply requires that the variance per first-stage unit due to the second-stage 
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sampling of the first-stage units be deducted from the variance per first-stage 
unit calculated from the sample. Thus for a two-stage random sample formula 
8.11.b is used. 

It is often helpful to carry out an analysis of variance on data derived from 
a two-stage sampling process. The situation is simplest when the number 
of sampled second-stage units n” in each first-stage unit is the same. Each 
stage of the analysis then follows the same pattern as the analysis of a single- 
stage sample of the same type. At the first stage, however, the values entering 
into the analysis must be either the means or the totals of the second-stage 
unit values. It is customary (though not essential) to tabulate the sums of 
squares of the first stage in terms of the second-stage units. If the first-stage 
unit means are used, therefore, all sums of squares at the first stage must be 
multiplied by 2”, while with totals all sums of squares must be divided by x”. 

In the case of two-stage random sampling, for example, the degrees of 
freedom and mean squares will be 


Degrees 
of freedom Mean square 
Between first-stage units oie s n—il ns’? =V pn" U 
Within first-stage units between 
5 tipi ra 
second-stage units .. e .. n(n” — 1) =V 
TOTAL = F as a wn =n 
We then have : f 
1-— 1-— 
ViQ) === = VEY. (8.11.c) 


where » is the total number of second-stage units and f is the overall sampling 
fraction (n=n'n" and f=f' f") The second term of this subdivision is 
the estimate of the variance that would be obtained with single-stage sampling 
of the second-stage units, the same total number of sampling units being taken, 
with first-stage units as strata, and uniform sampling fraction. The analysis 
of variance therefore provides a further alternative subdivision of the sampling 
variance. 

The results are similar with stratification with uniform sampling fraction 
at either or both stages. F 

When one or both the sampling fractions are variable, or when the numbers 
of second-stage units in the different first-stage units are unequal, the analysis 
of variance becomes more complicated and the direct approach is often simplest. 
With moderate inequality in the n” the analysis of variance of the first-stage 
units may be carried out on the means, with multiplication of the mean squares 
byn”, or better by the harmonic mean of the n”, i.e. the reciprocal of the mean 
of the reciprocals. 

Alternatively the whole analysis may be carried out in terms of the second- 
Stage units. In this case both the means (in terms of the second-stage units) 
and totals of the first-stage units are tabulated, the sums of squares being obtained 
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by the “ mean X total ” rule, i.e. every mean is multiplied by the corresponding 
total. . a ae : 

Tf the second method is used n” can be replaced by 7” in the expression 
n's? =V + n” U given above for the mean square for first-stage units of 
a random sample, or better by ng” where 


jig” = {S (0°) — S (n”)/S (w) yot — 1) (8.11.d) 


With a stratified sample a value for 7)” is calculated for each stratum and a 
weighted mean taken, weighting by the degrees of freedom contributed to the 
within-strata sum of squares (Cochran, 1939, A). 

These alternative methods of analysis are not exactly equivalent, but we 
cannot discuss their differences here, beyond stating that the first method is 
generally best when all the first-stage units are of approximately the same 
size and the variation in the numbers of second-stage units per first-stage unit 
is due to extraneous causes, whereas the second method is likely to be preferable 
when the first-stage units vary greatly in size and the number of second-stage 
units per first-stage unit is about proportional to this size. 

The above methods can easily be extended to multi-stage sampling with 
more than two stages. 


Example 8.11.a 


Calculate the expected sampling errors of the wheat acreages derived from 
the two-stage sample B, of Hertfordshire farms of Table 3.11.b, and discuss 
the effects of varying the number of parishes in the sample, with adjustment 
of the second-stage sampling fraction so as to give the same total number 
of farms in the sample. 


Part A of the variance has already been determined in Example 8.9.c. 
We have A = 8-263 x 10°. 

The determination of part B requires the evaluation of the variance of the 
r for individual parishes due to the second-stage sampling of these parishes. 
These variances were evaluated separately for each of the 17 parishes of the 
sample, using method (a) of Section 7.9. The mean value of these variances 
v’"(r) was found to be 0-003575, 


The equation of estimation of the total acreage is Y =X; The 
second-stage variance of Fi is V” (r)/ni, and part B of the variance is therefore 
given by 


B = V" (r) È XP/ni = 0-003575 x 45-429 x 10° = 16-24 x 108 
Hence V (Y) = 24:50 x 10°. 


Exact treatment of the effects of varying the number of parishes is complicated 
by the fact that the first-stage sampling fractions are bound to vary somewhat 
from district to district, and that the number of farms per parish is also variable. 
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Ignoring these sources of disturbance we may write for any number x’ of 
parishes and n” of farms per parish 


ee 


nn 


V(Y)= 


7 
n 


The values of a and f can be determined from the values of A and B. 
The mean number of farms per parish is N” = 2496/91 = 27-429. Putting 
w = 11, f’ = 17/91, f" =}, and n” = 4} x 27-429 = 6-857, we have 


17 
el gB x 10° = 172-7 x 10° 
@= 7-171 *§ x 
6-857 
2 SOT yx 16-24 x 10° = 2524-2 x 10° 
=i 


The effect of any variation in n’ and n” can now be determined from the 
formula for V (Y). If the total number of farms 2 (= n’ n’) is to be kept fixed, 
the formula is best rewritten in the form 
V (1) = (a — IN’ + Bin — aN 
= {80-Tl/n' + 2524-2/n — 1-8982} x 10° 
with the checks that, when n’ = 91 and n = 2496, V (Y) is zero, and when 
n =17 and n=17 X 6-857 = 116-57, it equals the value given above. 

The values of V (Y)/10° for 5, 10, 20, and 30 parishes and a number of 
farms, 116-57, approximately the same as that of the actual sample, are 35-9, 
27-8, 23-8 and 22-4 respectively. If all 91 parishes were sampled the 
Corresponding variance would be 20-6. There is thus no great gain in taking 
more than 20 parishes when sampling within parishes is with a uniform 
sampling fraction. 

It should be noted that the use of the above formula when the fraction 
of parishes sampled is large is unrealistic, in that sampling with probability 
Proportional to size could not be adopted in such cases. It serves, however, 
to illustrate the use of the similar formule which could be developed for 


sampling with uniform probability at the first stage. 


Example 8 .11.b 


Repeat the analysis of Example 8.11.a for the sample B, of Table 3.11.b. 


Part A of the total variance will be the same as in sample B,. 

Part B can be calculated in the same manner as in Example 8.11.a, using 
the method of Section 7.10, with a separate value of the ratio for each parish. 
This gives V” (r) = 0:0008167, and B= 3-710 x 10%.. Hence A+B 
= 11-97 x 108. 

The expression of V (Y) in terms 
sampling fraction at the second stage. 
an average sampling fraction f” for t 
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a formula of the same form as in Example 8.11.a. This gives f” = 0-29121 
and we then find 

V (Y) = {146-83/n’ + 710-76/n — 1-8982} x 10° 
with checks as before. 

This formula gives values of V (Y)/10° for 5, 10, 20 and 30 parishes and 
135-8 farms of 32-7, 18-0, 10-7 and 8-2 respectively. With more accurate 
sampling of the farms within the selected parishes, therefore, there is a more 
marked decrease in variance as the number of parishes is increased. The above 
values underestimate the decrease, as with the reduction in the number of farms 
per parish the change in the second-stage sampling fractions will result in a 
somewhat smaller increase in the variance than that given by the formula. 
More accurate values could be obtained by recalculating the second-stage 
variances with various intensities of second-stage sampling, using graphical 
methods for interpolation between the calculated values. 


Example 8 .11.¢ 


Investigate the relative precision of the determination of the acreages of 
crops by the measurement of the areas of fields and part fields included in a 
sample of rectangular areas, and the use of grids of points covering these areas 
(Section 4.24), 


If a sample area has an area a and the proportion of the area occupied by 
a given crop is p, the area y occupied by this crop in this sample area equals ap. 
If a random set of points is taken over the area the variance of the estimate y 
given by the proportion of points falling in the given crop is 

V (y) = pqa’/n 

This variance will be additional to the sampling variance of the y over 
the sample areas. This latter variance depends on the sampling method and 
the variability of the y from area to area, and can only be determined from 
actual sample data. 

As an example we may consider the case in which the areas are randomly 
selected, and the frequency distribution of the areas with Proportions 0-0, 
0-1, . . . of the given crop is as follows: 

Proportion, p 0:0 01 02 03 04 05 06 07 08 0-9 1-0 

Frequency, p 0-05 0-15 0-20 0-15 0-12 0-10 0-07 0-05 0-03 0:03 0:05 

The average variance due to the point sampling is given by 

Z ø V (y) = (@/n) È ppg = 0:1656 a2/n 
The sampling variance of the y is given by 
V (y) = È p p? è — (È ppa)? = 0-069024 a? 
The proportional increase in variance due to point sampling is therefore 
0-1656/0-069024 n = 2-40/n 
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With 9 points there is an increase of 27 per cent. in the variance and with 
16 points an increase of 15 per cent. In the latter case about one-seventh 
more areas will be required for the same accuracy. Against this must be set 
the fact that only fields in which the points fall need be examined and recorded. 
The occurrence of mixed crops and systematic location of the points on a 
rectangular grid will also reduce the sampling variance. 


8.12 An example of a pilot sampling scheme for crop estimation 


In order to investigate the practicability of obtaining estimates of the yields 
of cereal crops in the United Kingdom by the harvesting of sample areas, the 
yields of a number of wheat fields were determined by this method in each of 
the years 1934-1938 (Cochran, 1939, A). Fields were taken in several districts 
each year, one or two fields being selected at random from the fields growing 
wheat on each chosen farm. The selection of farms in each district was not 
random, the farms being taken in the neighbourhood of the centres at which 
the investigators were located. 

The sampling of the individual fields followed the lines described in 
Section 4.29, the fields being traversed in the direction of the rows, along two 
lines selected at random. Two sets of unit areas were taken from each line. 
Each unit area consisted of } metre of each of 6 contiguous rows. For the 
most part, sets each contained three unit areas, equally spaced along the line, 
With a random starting point. 

The yields of grain obtained in 1937 are shown in Table 8.12.a. The 
mean yield of all the unit areas in each set is given. In order to allow for 
differences in row spacing on the different fields the yields have been reduced 
to a 6-inch row spacing, and therefore represent the yields in grams of areas 
of } metre x 3 ft. Fields on the same farm are indicated by brackets. In 
District III, where three fields from a single farm were sampled, each field 
Was growing two varieties which were sampled separately. 

The analysis of variance was carried out in units of the totals of the tour 
Sets, z.e. on yields of areas of 1 metre x 3 ft. or 0-000226 acres. Thus the 
sum of squares of the sets is multiplied by 4, and the sum of squares of the 
line totals by 2. 

The sums of squares for 1937 can be obtained from Table 8.12.a by 
calculating the sum of squares for each classification, disregarding the others, 
and deducting the sum of squares corresponding to the next higher classification. 
The rule of “ mean x total” or “ total*/(number of units)” is followed in 
each case. Thus the correction for the mean is 11,706?/39 = 3,513,601. The 
Sum of squares for districts is 


41098)? + 4 (909)? + . . . — 3,513,601 = 54,224 
The sum of squares for farms is 


4 (422)? + 4. (676)? + } (619)? + 290? + 3 (2053)? ++... 
— 8,513,601 — 54,224 = 132,062 
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The arithmetical work can be simplified by omitting items which are 
repeated in more than one sum of squares. In particular the sum of squares 
for varieties is 


3642 + 4052 -+ 2872 -+ . . . — $ (769)? — } (597)? — 4 (687)? = 1326 


TABLE 8.12.a—SAMPLING OF WHEAT FIELDS, 1937: MEAN YIELDS OF GRAIN 
PER UNIT AREA (0-0000565 ACRES) IN GRAMS 


District I District II District IIT 
= ta i = ` 
Ist fSet 1 47 48 75 105 93 58 76 92 89 89 75 70 80 
line } Set 2 63 51 71 82 84 78 57 83 111 58 72 85 97 
2nd J Set 3 67 45 75 97 75 68 79 93 90 70 82 76 111 
line | Set 4 55 46 85 86 80 83 78 96 115 70 81 102 66 
232 190 306 370 332 287 290 364 405 287 310 333 354 
fa 4 4 = t 
422 676 619 769 597 687 
| Totals 1098 909 2053 
District IV District V 
Sn an 
Ist (Set 1 29 45 57 69 78 59 68 97 60 65 81 77 60 
line ) Set 2 21 39 63 55 109 59 56 88 53 59 94 74 88 
2nd j Set 3 29 69 46 21 90 58 74 109 43 49 93 “44 84 
line |Set 4 31 57 66 40 51 53 61 95 48 71 92 57 97 
110 210 232 185 328 229 259 389 204 244 360 252 329 
ey ea I, 
320 417 581 
Totals 2750 581 
District VI 
SS (epee) N 
lst Set 1 66 93 55 127 84 80 81 93 21 84 87 79 90 
line } Set 2 73 70 56 106 80 86 107 106 63 51 67 79 117 
2nd J Set 3 64 80 83 84 63 88 135 71 50 82 135 71 112 
line [Set 4 73° 67 60 98 89 110 82 83 29 80 114 89 122 
276 210) 254 415 316 364 405 353 163 297 403 318 441 
X CAA a 
586 669 680 
Total 4315 


Furthermore the sum of squares corresponding to the difference between 
any two totals containing the same number of units can be obtained by squaring: 
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the difference and dividing by twice the number of units in either total. Thus 
the sum of squares due to varieties is also given by (41? + 23? + 21°). 
This gives a useful check in cases in which, as here, many of the sums of 
squares depend on differences of pairs of values. Thus the sum of squares 
between sets is given by 2(16?-+ 122+ 3%+ 1° +...) = 52,368, that 
between lines by 122+ 8? -+ ... = 33,924. 

The sum of squares between fields within farms (excluding District III) 
is given by } (422 + 64% + 45° + 1002 + ...) = 27,702. The sum of squares 
between fields within farms for District III has to be calculated in the ordinary 
manner, since there are three fields, and is à (769)? + 4 (597)? + 4 (687)? 
— } (2053)? = 7,401, giving a corresponding total sum of squares of 35,103. 

Those not fully familiar with the analysis of variance technique should 
recalculate the sums of squares of this example in the various alternative ways 
indicated. 

For general purposes it is best to convert the mean squares into some 
common units such as (cwt. per acre)?. The conversion factor is here 0-0075861. 
This is done in Table 8.12.b, which also shows the results of the similar 
analyses for the other four years of the investigation. 


TABLE 8.12.b—ANALYSIS OF VARIANCE PER FIELD OF YIELDS OF WHEAT GRAIN 
(CWT. PER ACRE) 


1934 1935 1936. | 1937 1938 
ds.| ms. |dt| ms. [at | ms. |dt.| ms. af.| ms. 

Between districts . 4 | 66-5 6 | 318-4 4 | 79-4 | 5 | 82-3 4 | 206-8 
Within districts be- | 

tween farms . | 11 | 38-9 | 12 | 27-1 7) 62-2 | 19 | 52-7 | 14 | 65:3 
Within farms be- | 

tween fields - |m — 15 | 22:8 8 | 31-2 | 11 | 24-2 
Within fields : | 

Sampling error . | 16 5-33 | 40 6-20 | 22 | 11-39 | 39 | 6-60 

Between sets . | 32 2-11 | 80 2-18 | 45 2-52 | 78 | 5-09 
Mean yield . À 29-1 23-3 24-3 | 26-2 


The mean squares for the same component of variance in the different 
years are not estimates of precisely the same quantities, owing to variation 
in the numbers of fields per farm, etc. In view of the small number of degrees 
of freedom in each year, however, we shall not lose much information by 
Pooling all the years, weighting the mean squares in proportion to the degrees 
of freedom. This pooled estimate is shown in Table 8.12.c. 

From the degrees of freedom we may deduce that there are 91 farms and 
133 fields in all in the sample, z.e. a mean of 1-46 fields per farm. Denoting 
the variance per set by Vj, the additional components of variance per line, 
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per field and per farm by V., U, and U, respectively, and ignoring the fact 
that a few fields have more than two lines and that the number of fields per 
farm is variable, we have the mean square equivalences shown in Table 8.12.c. 
Hence V, = 14-0, V, = 8-4, U, = 15-0, U, = 18-2. 


TABLE 8.12.c—CoMBINED ANALYSIS OF VARIANCE 


| 
Degrees 
of Mean Estimate 
freedom square 
Between districts . š 23 162-3 | 
Within districts between | 
farms f x ; 63 49:3 | EV, + $V: + U; + 1-46 U, 
Within farms between 
fields : è ; 42 22-7 | FVF NU 
Within fields between 
lines 3 x ` 145 7-69 Vi + 4V2 
Within lines between 
sets . = A ` 290 3-50 V: 


The value of U, is an underestimate, firstly because we have used the mean 
number of fields per farm, instead of calculating the correct value of no” from 
formula 8.11.d, and secondly because of the fact that the sample of farms 
was not random. For 1937 the value of i” is 1-29, compared with the 
value of n” of 1-44. 

From the above estimates of the different components of variance we may 
calculate the variance to be expected with a sample of any given type and 
size. If the unweighted mean of the yields per acre of the different fields 
can be taken as the estimate of the mean yield per acre over the country, i.e. 
if the potential bias due to association of yield per acre with size of field, etc., 
can be ignored, the variance of the mean yield per acre with a fixed amount 
of sampling of individual fields and with equal numbers of fields taken from 
all selected farms will depend solely on the number of farms and the number 
of fields in the sample. If these are 7, and 7, respectively the variance of the 
mean yield per acre with the same amount of sampling per field as that actually 
adopted will be 

Uy'/ng -+ U,/n, 


where U,’ = $V, + $ Va + U, = 22-7. Thus with 200 fields from 100 farms, 
2 fields per farm, the variance with the above values of the components of 
variance is 0-296. This is equivalent to a standard error of 0-54 cwt. per acre, 
or 2-0 per cent. of the mean yield. 
This is an over-simplification of the practical situation. In general the 
ossibility of bias cannot be ignored, and a properly weighted mean must 
therefore be taken. Any statement in general terms would be difficult, since: 
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the weighting depends on the variation in numbers and acreages of the fields 
on the individual farms and the sampling method adopted. Given the numbers 
and acreages of the fields on an adequate sample of farms, however, the weighting 
coefficients for any chosen method of sampling can be determined. If these 
are denoted by w, and if [w] denotes the sum of the weights for all the sampled 
fields on a farm, the variance of the weighted mean will be 


{Ur S (w*) + Us S (Leo) F/S (0) 


Thus the relative precision of alternative methods of sampling can be evaluated 
without difficulty. 

The above procedure is approximate in another respect which is not entirely 
irrelevant to the practical situation. It has been assumed that the component 
of variance from field to field on the same farm, and that between farms, are 
independent of the size of the farm. This is not likely to be strictly true, and 
may introduce appreciable inaccuracy when a variable sampling fraction is 
used for farms of different sizes. 

It will be noted that no corrections for sampling from a finite population 
are necessary, provided the fraction of the farms in the sample is small. 'The 
second-stage sampling fraction of fields from farms may be large, but this 
fraction does not enter into the formula for the partition of the variance given 
by the analysis of variance, as for example is shown by formula 8.11.c. 

Although the variance of the unbiased estimate depends on the acreages, 
and therefore cannot be easily formulated in general terms, certain general 
statements about the relative precision of different types of sample can be made 
from the above results. 

In the first place we may consider the effect of varying the amount of 
sampling of the selected fields. If the number of sets per line is reduced to 
one, for example, the sampling variance per field will be } V, + 3 Va = 11-2, 
instead of 7-7. If at the same time the number of lines is increased to four, 
the sampling variance per field will be $V, + ł Va = 5-6. 

There is little to be gained by increase in the accuracy of the determination 
of the yields of individual fields, however. With one field per farm the 
effective variance per field if the yields are determined without error will be 
U, + U, = 33-2, instead of 40-9. Consequently with the intensity of sampling 
actually adopted the relative accuracy is 0-81. Doubling the number of lines 
per field with two sets per line would only increase the accuracy by 10 per cent. 

The question of whether to sample one or more fields per farm requires 
more consideration. For farms growing a given number of fields of wheat 
(greater than one) the variance if two fields are sampled will be } U,’ + Us 
= 29:6, whereas if one field is sampled the variance will be 40-9. The whole 
question of the’ methods of sampling farms and fields is bound up with the 
question of costs, and will be further considered in Section 8.17. 

The effect of varying the number of unit areas per set cannot be precisely 
determined from the above analysis. If the unit areas of each set were randomly 
and independently located, the variance of the set means would be inversely 
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proportional to the number of units per set, and would be determined from 
the variance between sets within lines. With the even spacing of units within 
the set, however, we may expect the reduction in variance with increasing 
numbers of unit areas to be somewhat greater than in the case of random 
location. From the basic data giving the yields of the separate unit areas it 
would be possible to determine the variance per unit area within sets, and 
from this variance and the variance between sets within lines a fair idea of the 
departure from the random law could be obtained. 

As a first approximation, however, we may assume that the random law 
holds. In this case, taking two instead of three unit areas per set would multiply 
the value of V, by 3/2, and would therefore raise the value of U,’ + U, from 
40-9 to 42-6. It was in fact recognized after the first year’s work that there 
was little to be gained from having more than a small number of unit areas 
per set, and the number, which was five in the first year, was then reduced 
to three. 


8.13 A special case of two-stage sampling 


The possibility of sampling from within strata with probability proportional 
to size at the first stage, and with second-stage sampling fractions so chosen 
that the overall sampling fraction is uniform, has already been mentioned in 
Section 3.10 and subsequently. 

This case is of considerable practical importance, and also provides a useful 
example of the application of the above methods to the more complicated types 
of two-stage sampling. 

From the results already given in Sections 7.16 and 7.17 we have 


VY) = Ear’? Xi (1 — fi’)! + Bf KV" (Fi) (8.13) 


where V’’(Fi) is the estimated second-stage variance of fj. 

We will consider the case in which the sampling at the second stage is 
random (or stratified with uniform sampling fraction) and the number of 
second-stage units is taken as the measure of size. If 7; first-stage units are 
selected from the ith stratum the probability of selection of the jth unit will 
be ni’ Nij/Ni, where Nij is the number of second-stage units in the jth first- 
stage unit, etc. The second-stage sampling fraction for this unit, if selected, 
will be f Ni/ni’ Ny, where f is the uniform overall sampling fraction. The 
number of second-stage units selected will be f Ni/ni’.. Thus the same number 
of second-stage units will be selected from each of the selected first-stage 
units in a given stratum. If the variance oj’? of y per unit at the second stage 
can be taken as constant for the whole of the 7th stratum the estimated variance 
of ry (=Ji/) will be 

W"(rij) = si”? fo") . 
Fini 


which is constant for all the selected units of the th stratum except for the 
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factor (1 — fij”). Provided all the fij” are moderately small it will be sufficient 
to replace them by fi” =f /fi. We then have 
VE = si" (1 = fi UF Ni 
The X; of formula 8.13 will be replaced by Ni, and some form of pooling 
can be adopted to estimate an average value sr’? of s7i’*. In cases in which the 
sri’? are likely to vary markedly, weights corresponding to those given by the 
first term should be used. 
We then have 
V (Y) = sr? DNE (1 — filni! + Df’ sf? Ni (1 — OS 
Following the previous procedure this may be re-written as 
v= sro? ENE (1 — fi’)/ ni + Esi Ni (l —fi')/F 


where 


mae Clase) een) 
rate TENEO = ffn 
If the fi’ are approximately equal, as will usually be the case, and the s;”? 
are the same for all strata, we have 
V (Y) = sro? (1 —f’) 2 Nên +N (fS 
"eN (a fil) 


r A ee ah 
mor = Sr? — f > Nefni 

The second term of V (Y) will be recognized as (1 — f ”)/(1 — f ) times 
the variance which would be obtained with single-stage sampling of the second- 
stage units with uniform sampling fraction and the first-stage units as strata. 
If f” is small, therefore, the first term gives the increase in variance due to 
the adoption of the two-stage process, as in the case of two-stage random 
sampling. 


> 
8.14 Effect of change in size of the sampling units 


If the population is divided into N large units, each of which is subdivided 
into K small units, and if a sample of k small units from each of n large units 
is taken, an analysis of variance between and within large units can be made, 
and the components of variance U and V estimated as in Section 8.10. 

The estimate of the overall variance between small units will be given, 
as before, by V + U (N — 1)/N. That between large-unit means of k small 
units will be given by 1/k times the expectation of the mean square between 
large units, i.e, by V/k-+ U. The variance between large-unit means when 
all K small units of each large unit are included will be V/K + U. 

These results enable us to determine the effect on the sampling error of the 
alternatives of using the large or the small units as sampling units. It will 

e noted that for this determination it is not necessary to have data in which 
all the small units that go to make up the selected large units are observed. 
The analogy with two-stage sampling will be apparent. 
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Various extensions of these results are of interest. If the large units are 
stratified, with N; units per stratum in the population and 7 in the sample, 
there will be a further between-strata component of variance U;, and the mean 
squares in the analysis of variance will provide estimates as follows :— 


Between strata 55 an vig - V+RU+RmU 

Within strata between large units .. -- V+rRU 

Within large units between small units .. V 
The estimates of the different variances will then be: 

Small units within strata .. T -| V+U(N—1)/N: 

Large units (means) within strata .. .- VK+U 

Large units (means) overall š =- VK +U + U(t — 1)jt 
The first two variances are the same as previously, except that N is replaced 
by Ni. 


The above approach enables us to determine the effect of simultaneous 
change of size of unit and size of strata. This is relevant when the strata can 
be of any size, and the size is therefore chosen to contain two units (or one unit) 
per stratum. If the size of unit is halved and the amount of material in the 
sample remains the same, for example, there will be twice as many units, and 
the size of the strata can therefore be halved. We shall then require a four-fold 
analysis of variance into whole strata, half-strata, whole units, and half-units. 
The minimum amount of data required for this purpose will be two whole 
units (of which each half-unit is separately recorded) in each half-stratum. 

The expressions for mean squares and variances will be similar to those 
given above, N; being the number of whole units per half-stratum in the 
population, and ¢ the number (two) of half-strata per stratum. These 
expressions are as follows :— 


Mean squares : 


Within whole strata between half-strata .. V+2U+4U; 
Within half-strata between whole units .. V-—+-2U 
Within whole units between half-units Vv 

Variances : 
Half-units within half-strata as -- V+U(Ni— 1)/N: 
Whole units within half-strata ws s: $€V+U 
Whole units within whole strata i V+U+1U, 


Since there will be half as many whole units as half-units the relative 
precision of the two methods of sampling will be given by 


V+2U4 Ur 
V + U (N: — 1)/Ni 


8.15 Variation in size of strata 


When the strata boundaries are arbitrary, the size of the strata may be 
varied in such a manner that a fixed number of units require to be selected 
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from each stratum, whatever the size of the sample. The strata will naturally 
be taken as small as possible, i.e. so as to contain two units if a rigorous estimate 
of error is required, or one unit otherwise. 

In order to determine the size of sample required for a given accuracy 
under these conditions it is necessary to know the relation between the size 
of the strata and the within-strata variance. The simplest way in which this 
relation can be determined is to obtain data for all units of a representative 
sample of the largest strata that are of interest. Strata of any smaller size 
can then be constructed, and the within-strata variances calculated. 

A minor difficulty in this construction is that the original strata will only 
be exactly subdivisible into strata of smaller size if these contain numbers of 
units which are integral fractions of the numbers in the original strata. If in 
area sampling the strata are also to be of the same shape, only squares of 
integral fractions, i.e. 4, P -+ +> will give exact subdivision. For strata of 
intermediate size there will therefore be a certain amount of arbitrariness in the 
location of the strata boundaries. Some objective rule must therefore be 
followed. If it appears desirable, overlapping strata may be used. Thus in a 
case in which the data cover a set of isolated squares, four sets of smaller squares 
may be taken within each large square, each set having a corner point coincident 
with one corner of the large square. 

If the smallest strata likely to be of interest each contain a large number of 
units, the collection of data in full for all the units of these basic strata is likely 
to be laborious. Instead a random sample of such units may be taken. In 
this case the within-strata variances can be estimated by means of an analysis 
of variance similar to that used for change in size of sampling units (Section 8. 14). 
If the small units of that section are taken as equivalent to the units of the 
present case, the large units as equivalent to the basic strata, and the strata 
as equivalent to the larger strata, the same expressions hold. 

This procedure has the disadvantage that variances can only be obtained 
for strata which contain an integral number of the basic strata—if the strata 
are all to be of the same shape the number must be a square. This disadvantage 
can be overcome by sub-stratifying the basic strata, with random selection 
of units from within these sub-strata. Thus in area sampling with square 
strata, if each basic stratum is subdivided into nine square sub-strata, with 
a minimum of two selected units per sub-stratum, square strata can be con- 
structed with areas of 1, 17, 2%, 4, 53, Th, 9, . . . times the area of the basic 
stratum. Separate analyses of variance will be required for the different sizes 
of strata, but these have certain elements in common. 

When the variances have been calculated for certain sizes of strata an 
approximate variance-size relationship can be constructed by graphical means. 
It is often advantageous to plot the log-variance against the log-size. A straight 
line on this graph represents a variance law of the type oz? = a 2b, where z is 
the number of units per stratum and a and b are constants. 

For the purpose of determining the size of sample required for a given 
accuracy it is better to plot z o2? against z. If N is the total number of units 
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in the population the number of units in a sample with two units per stratum 
will be 2 N/z, and we shall have z0: = 2N V ( f). Thus the required size 
of the strata can be read off from the graph. With one unit per stratum the 
factor 2 is omitted. 
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Fic. 8.15—RELATIONS BETWEEN COST AND ACCURACY IN SAMPLING FOR MEAN SOIL 
TEMPERATURE OVER A PERIOD, WITH STRATA OF VARYING SIZE 


The mean temperature is estimated from temperatures taken (a) on two days 
selected at random from each stratum (block of days), (b) on one day selected at 
random from each stratum, (c) on days equally spaced throughout the period. 


Reproduced by permission of the Royal Society (Yates, 1948, A). 


It has been pointed out in Section 3.14 that systematic sampling, when 
used on the type of material for which it is suitable, is likely to have an error 
variance which is somewhat less than random sampling with one unit per 
stratum. In neither type of sampling can the sampling error be estimated 
with any certainty from the results of a single sample. In random sampling 
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with one unit per stratum, however, an objective estimate of error is possible 
if additional randomly located units are taken in certain of the strata, but 
in a systematic sample much more elaborate methods have to be used (Yates, 
1948, A), and even then the estimates obtained are not fully objective. 

It may be noted here that the common practice of estimating the error 
of a sample with one unit per stratum by combining the strata in pairs, will 
give an estimate of error which will generally be somewhat greater than the 
true error with strata of double the size and two units per stratum. 

An example of the relation between the accuracy of sampling with two 
units per stratum, sampling with one unit per stratum, and systematic sampling, 
is given in Fig. 8.15. The curves (full lines) are based on the variances 
found for daily soil temperatures at 1 foot depth, each daily reading constituting 
a sampling unit. The cost scale is proportional to the number of units, and 
the accuracy scale gives the accuracy of the sample estimate of the mean soil 
temperature over a period. The curves themselves are based on relations for 
oz? of the type given above. The curve of losses due to errors and the broken 
curves will be referred to in Section 8.18. 

The material is of the type in which the reduction in variance with reduction 
in size of strata may be expected to be considerable. This is brought out by 
the curves. The relative precisions of the three types of sampling are given 
by the intercepts of horizontal lines, which are in the ratio 1: 1-75: 4-24, 
The relative efficiencies are given by the reciprocals of the intercepts of vertical 
lines, which are in the ratio 1: 1-36 : 2:22, This provides an illustration of 
the marked difference between relative precision and relative efficiency when 
reduction in the size of the strata results in a considerable reduction in variance 


per unit. 


8.16 Efficiency in terms of cost 


In the previous sections we have described how to determine the size of 
sample necessary to attain results of a given accuracy when various methods 
of sampling are used. We have also indicated how the relative efficiency (in 
terms of numbers of sampling units) of different sampling methods and 
variations in a given method may be judged. Minimization of the number of 
Sampling units or amount of material included in the sample will not in general, 
however, give maximum efficiency in terms of cost. To attain this the sampling 
method must be so chosen that the total cost of the survey is minimized. 

__To minimize the total cost it is necessary to know the relative costs of the 
different operations. Exact evaluation of these costs is usually troublesome, 
and is only worth while if an extensive survey has to be undertaken, or 

a series of surveys on similar material is contemplated. The matter is 
complicated by the fact that for many purposes it is the marginal cost of an 
additional unit, rather than the average cost per unit that is required. Never- 
theless it is not difficult in the course of survey operations, or even in the 
course of a pilot survey, to obtain data which will serve to give rough estimates 
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of the main components of the costs. With the aid of such estimates the 
efficiency of further surveys of similar material can often be substantially 
improved. 

When information on the costs of different types of operation is available 
it is possible to determine the values of the sampling fractions, etc., which 
for a given sampling method will give results of the required accuracy for the 
least cost. Such values may be termed the optimal values. It is also possible 
to determine which of two methods, each employed in the most efficient 
manner, will be the least costly. 

The determination of optimal values of the sampling fractions, etc., requires 
minimization of the cost function, and will be dealt with in the next section. 
The choice between different methods when the optimal values of the sampling 
fractions, etc., are known, or when there are no variants of this type, can be 
obtained directly from the results of the previous sections. 

Thus in the case in which there is the possibility of using supplementary 
information, if cs represents the cost per unit of obtaining the supplementary 
information, co the marginal cost per unit when no supplementary information 
is obtained (these costs being taken to include the marginal costs of abstraction 
and computation), and C, represents the additional computational cost of 
utilizing the supplementary information (which apart from the above marginal 
cost per unit may be taken as broadly independent of the size of the sample), 
the total cost of a sample of ns units with supplementary information, excluding 
elements of cost which are fixed for both methods, will be 


Cs = Cy + ns (co + cs) 
and that for a sample of no units without supplementary information will be 
Co = no Co 
Under conditions in which the error variance is inversely proportional 
to the number of units in the sample, the two samples will be of equal accuracy 
when the numbers of units are in inverse ratio to the relative precision of the 


two methods with equal numbers. If the regression method of adjustment is 
used, therefore, and the sample is random, 


ns/no = 1 — ọ? 


where p is the true correlation coefficient between the main and the 
supplementary variates (estimate 7). 


Hence the use of supplementary information will be more efficient if 
No Co > Cy + no (1 — p?)(co + cs) 
i.e. if 
no (co + cs) p? > No Cs + Ci 
If the cost of adjustment C, can be ignored this inequality becomes 
tslco < 1 — p?) 
which is independent of no, and therefore of the accuracy required. Thus, 
for example, under these conditions, if pọ = the use of supplementary 
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information will be worth while if the cost of collection is less than one-third 
the cost of taking an equal number of additional units. 

With 9 = }, however, the gain will not be marked unless the ratio of the 
costs is considerably less than $. With a ratio of 3 the total costs will be in the 
ratio of 7:8 (minimum value, with zero value of the cost ratio, 3:4). With 
higher values of 9 the gains are more marked. With 9 = { the two methods 
have equal cost when ¢s/¢o = 9/7. In this case, when the ratio has the values 
4 and + the ratio of the total costs will be 21:32 and 35:64 respectively 
(minimum value 7 : 16). 


8.17 Minimization of the cost function 


When the sampling fractions, etc., of a method of sampling are not fully 
determined by the accuracy required in the results, the optimal values can be 
determined by minimizing the cost function. 

The total cost can usually be expressed as a linear function of the numbers 
of sampling units 7, a ++ - in the various strata, etc., at least to a first 
approximation, using marginal costs. The simplest procedure is then to add 
a multiple K of this linear function to the expression in terms of 7y, 7g, . - - 
for the variance of the required estimate, and differentiate the resultant 
expression with respect to 7%, 7%) - - - in turn. This minimizes the variance for 
fixed cost, which is equivalent to minimizing the cost for fixed variance. The 
exact procedure will be apparent from the first of the cases treated below. 


(a) Variable sampling fraction 
If the marginal cost of taking an additional unit of the ith stratum is ci, 
the total cost C, omitting constant elements, is given by 
C= Leni 
Hence 
i v (Y) = DoF (1 — fi) Nè fni (8.17 .a) 
= Lo? (1/mi — 1/Ni) N? + K (2 ci ni — C) 

Differentiating with respect to the ni and equating to zero, we have the 

t equations (i = 1, 2, ..- t) 
— oP N?/n? + Kci = 0 
Hence, since m/Ni = fi, 
A J2 1 
o/c Sev c2 5i AAN 
Thus the optimal sampling fractions are proportional to ci/4/ci. This is an 
extension of the formula already given in Section 3.5. 

The actual values of the sampling fractions required to attain a given 
accuracy can be obtained by substituting for the fi in equation 8.17.a and 
solving for K.* This gives 

(VK) EN: si vci = V (Y) + ENi c? (8.17.c) 


* An example of the calculations will be found in Example 10.4. 


(8.17.b) 
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If any of the f; are equal to unity the corresponding terms must be omitted 
from both sides of the equation. This may require a trial solution. 


(b) Two-phase sampling 

If c, represents the cost per unit of obtaining the first-phase information, 
c, the additional cost per unit of obtaining the second-phase information, and 
there are 7, first-phase units, of which 7, are included in the second phase, 
the total cost, apart from constant elements, will be given by 

C = n Cy + m2 Cy 

When the methods of sampling and estimation are such that the effective 
variances of the estimates at each phase (apart from corrections for finite 
sampling) are inversely proportional to the numbers of units, from the results 
of Section 8.7 we have 


z E Ta) 2 
EN a T (1 =) oy (8.17.c) 
Following the above procedure, we find 
aig G oi? c 2 
a ee E (8.17.d) 


mê C oy? — of co 1—K 
where x = 9,/c,. The values of n, and n, required for a given accuracy can 
be obtained by substituting for n, in terms of m in V (F). 


(c) Two-phase point sampling 

We will only consider the special case arising in crop estimation (Examples 
6.16.b and 7.15). 

If the acreages are determined from nọ’ points and the yields per acre are 
determined on fields of the crop in question in which 7 of these points fall, 
and if c’ is the cost of visiting the field to determine the nature of the-crop, 
and c the additional cost of a yield determination, we have, when a proportion 
p of the area is under the crop and the mean yield per acre is f, 


C=c' n +en 
VYY = qip no! + V (n E? (8.17.6) 
Hence 
n pve 
me ge (8.17.f) 


The values of 7,’ and n are best obtained by substitution for n in terms of 
no! in the equation for V (Y). 


(d) Two-stage sampling 
If c’ is the cost per first-stage unit, and c” the additional cost per second-stage 
unit, the total cost is given by 
C=n dyne" 
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With a random or stratified random sample with uniform sampling fraction 
and equal numbers of second-stage units per first-stage unit, V (Y) is given by 
formula 8.11.c. Thus 


V (F) = U/n' + Vin + const., 
where 
U=a,2—o02/N”, and V=o", 


Following the previous procedure, we find 
n/n’? =n! = Vc'[Uc" (8.17.g) 


In other words the number of second-stage units per first-stage unit is 
independent of the accuracy required. The values of n’ and 7 required for 
any given accuracy can be obtained by substitution in the equation for V (7). 

The same formule hold for any form of two-stage sampling in which V (y) 
can be written in the above form. Thus stratification with a variable sampling 
fraction at the second stage is covered, provided none of the sampling fractions 
are unity. 


(e) Two-stage sampling with probability proportional to size at the first stage 


The solution of the case of sampling from within strata with probability 
proportional to size of unit at the first stage follows similar lines. We find 
that if the cost per second-stage unit is the same for all first-stage units in all 
strata, one condition for minimum cost is that the second-stage sampling 
fractions are so chosen that the overall sampling fraction is uniform. Thus 
the use of a uniform overall sampling fraction, which is computationally con- 
venient, is justified on grounds of minimum cost. The assumption of constant 
cost per second-stage unit will not in fact hold for the component of cost due 
to travel, since the same number of second-stage units will be taken from any 
selected unit of a stratum, and consequently the travel cost per unit will be 
greater for the larger units. This, however, is not likely to reduce the efficiency 
greatly unless travel costs at the second stage are very large. 

The relation between the first-stage and second-stage sampling can also be 
very simply expressed. In the case considered in Section 8.13, in which the 
size of the first-stage units is represented by the number of second-stage units, 
if all the second-stage variances are equal we may put orig’ — o”? Ni'/Ni = Ui 
and o”? — V, the costs per first and second-stage unit being taken as ci’ and c". 


We then have 
VN + V(Vie") E Ni V(U; c’) 
Í= — VY) + ENE UN? 
nif? =f? NE Uic N ci! 


; pints the number of second-stage units n;” per first-stage unit in stratum > 
1s the same for all selected units, and independent of which particular units 
are selected, the above equations give mi”? = V ci'/U;i c”, as before. 
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(f) Two-stage sampling of farms and fields 


Any fully general treatment is difficult owing to the fact that the numbers 
of fields per farm carrying a given crop are usually small, and consequently 
what would otherwise be the optimal values of the second-stage sampling 
fractions will give numbers of fields per farm which are not only non-integral, 
but which will in many cases be less than unity. 

If the numbers of fields per farm are sufficiently large for this source of 
disturbance to be neglected, the optimal values of the sampling fractions can 
be simply expressed. 

In order to standardize the notation we may replace the between-farms 
component of variance U,, as defined in Section 8.12, by U’, and the between- 
fields within-farms component U,’ by U”. The cost of visiting a farm may be 
taken as c’, and that of sampling a field as c”. 

We will consider the case in which the farms are divided into size-groups 
with fields which have mean areas 4, @,.... We will further assume that 
the mean acreages per field within a size-group of farms of 1, 2, 3, . . . fields 
are the same, and that V(aj)/@;* is constant for all size-groups and for all 
numbers of fields within a size-group. 

In the first place we find that in this case the second-stage sampling fractions 
within a size-group should all have the same value, which is given by 


U” e Ny 
T ee E ” 
fi T 7 iN’ (8.17.h) 
where Ni’, Nig’, Nig’, . . . are the numbers of farms in the group with 1, 2, 
3, . . . fields respectively, and [Ni’], = Ni’ + 4Nie’ + 9 Nig’ +... The 
ratios of the first-stage sampling fractions are given by 


a =k? say (8.17.1) 
a? [N,’],/Ny TIN, J/N; TTEA = y lt, 


These equations will serve to give first approximations to the relative 
sampling fractions. The relative efficiency of different variants which are 
practically applicable can then be tested by use of the expression for the 
variance of the weighted mean given in Section 8.12. It will usually be sufficient 
to use the mean acreage for each size-group in evaluating the weights, but in 
evaluating the actual size of sample required the factors aj? should be replaced 


by a+ Vv (ai). 


Example 8.17.a 


Determine, from the data of Examples 6.12.b and 7.12.b, the optimal 
proportion of sample plots to eye estimates on conifer stands in a two-phase 
sampling scheme in which eye estimates only are made at the first phase, when 
the cost of visiting a stand and making an eye estimate is 5 the additional 
cost of measuring a sample plot. 
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Consideration of this problem would be relevant if it were possible to 
demarcate and classify the stands into conifers of over 20 years of age, etc., 
from aerial photographs. In this case, if the sampling of stands is with 
probability proportional to size, the variances of the volumes per acre, given 
in Example 7.12.b, will be required. We then have, with the regression 
method of estimation, 


Kê = 0-745 K/(1 — r?) = 2-92 ajep= 1/10 
n/n, = y (292/10) = 0-540 
Thus sample plots should be taken on about one-half the stands which are 
visited. 
There is, however, in this case no appreciable gain by the use of two-phase 
sampling. If n, is the number of sample plots required if no eye estimates 
are made we have, for equal variance, 


| a le 1 Ng > 
o? =—0° + 1— o 
mt Ta “a n 


nfn = ndn + (1 — nfm) K? 
= 0-540 + 0:460 x 0-745 = 0:883 
n/n = 1-64 


which gives 


Thus with two-phase sampling 
Cing c = 1:64 + 10 x 0-883 = 10-47 


and with single-phase sampling C/n, ¢, lies between 10 and 11, depending on 
the saving due to the omission of the eye estimates on the stands that are 
Visited. ə 

One of the reasons why the use of eye estimates is here of little value is 
that the determination of the volumes of individual stands by means of a single 
sample plot per stand is very inaccurate. If more sample plots per stand were 
taken the overall variance of y would be reduced, while the covariance of y 
and x and the variance of x wouid remain unaltered. Under these circumstances 
two-phase sampling would be more advantageous. Given information on the 
within-stand variance of the sample plots and the cost of taking different 
numbers of sample plots from a stand, the optimal number of sample plots 
Per stand could be determined. 


Example 8.17 p 


If in the crop survey of Examples 6.16.b and 7.15 the additional cost of 
crop-cutting in order to obtain an estimate of yield is 20 times the cost of 
vomog a sampling point to ascertain the nature of the crop, calculate the optimal 
ratio of the number of yield determinations to total sample points, and the 
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number of points required to give an estimate of the total yield with a standard 
error of 5 per cent. 


From the results already given we have 
p = 2,202/33,255 = 0-0662 q = 0:9338 
f=15% V (r) = 3-5? V (r)? = 0-0497 
Hence, from equation 8.17-f, 
n +0662 x -0497 


ng N 9338 X20 0;0183 


Substituting in equation 8.17.e, 
(05)? n = -9338/-0662 + -0497/-0133 
nol = 7140 =95 nt = 473 


The large number of points that have to be visited to ascertain the crop is 
accounted for by the small fraction of the total land area under crop. If the 
crop is an important one it will occupy a considerably larger fraction of the 
cultivated area, and if, therefore, the non-cultivated areas can be excluded, 
the total number of points required will be considerably reduced. Alternatively 
sparsely cultivated areas may be sampled with a lower intensity. 

It is also worth noting that if several crops have to be surveyed it will 
probably be possible to make the acreage determinations of all crops 
simultaneously prior to the crop-cutting work. This will alter the above cost 
relationships. The general case can be dealt with by minimization of the 
combined cost function. In the simple case in which there are a number of 
crops each occupying the same area and having the same variance and cost 
relationships, and in which the same accuracy is required for each crop, the 
above solution holds, the cost of the acreage determinations being’ spread 
equally over all the crops, and adjustment of ¢ being made for the cost of 
revisits. Thus in the above example, with 5 crops and a cost of revisit per 
point of double the original cost (owing to wider dispersion), all that is necessary 
is to put c/c’ =110. We then find nọ = 9160, n = 52, n' = 606. 


Example 8 pe ar} 


If county lists of farms are not available, and if the cost of the construction 
of a list of the farms of a parish is 10 times the cost of visiting a single farm 
within the parish and ascertaining the wheat acreage, determine the optimal 
sampling fractions at the first and second stage which will give estimates of 
the acreage of wheat having a standard error of + 4500 (i.e. approximately 
10 per cent.), using the methods of sampling followed in samples B, and B, 
of Hertfordshire farms. 
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From the equation for V (Y) in terms of m’ and n given in Example 8.11.a 
and equation 8.17.g, we have r 
n” = 4/(2524-2 x 10/80-71) = 17-7 
Hence, for the required accuracy, 
4.52 = 80-71/n' + 2524-2/17-7 n’ — 1:8982 


Similarly, from Example 8.11.b, 
nae =T7-0 n= 11:2 n = 18-4 


For the reasons already given these latter values are approximate. When 
proper allowance is made for the changes in sampling fractions at the second 
stage a somewhat smaller sample will be found to be necessary. 


Example 8.17 .d 

From the data of Table 6.19.a determine suitable sampling fractions fof 
a crop-estimation scheme for sugar beet, on the assumption that variances 
between fields on the same farm, and between farms, are the same as those 
found for wheat in Section 8.12, and that the cost of visiting a farm is (a) equal 
to, and (b) twice that of sampling a field. 


From Table 6.19.a, including farms not growing sugar beet on old arable 
land, we obtain the following values : 


ni’ [ni'le ai 
Small farms g wee 30 51 3-1 
Medium farms ge AT 224 8-4 
Large farms a i 14 121 12-3 


Taking the estimates of Ni’/[Ni’]2 given by ni'/[ni'], formule 8.17.h and 
8.17.i give the following values for fi” and fi’: 


af? fe’ filk 

C= Cy Gq =he All c 

Small farms ast Ak 0-86 1-21 4-0 
Medium farms... ne 0-51 0:72 18-3 
Large farms a oes 0:38 0-54 36-2 


Thus when c, = c, we may consider variants on the following scheme. 
For small farms sample all fields ; for medium farms sample one field from 
farms growing 1 or 2 fields, two fields from farms growing 3 or 4 fields, etc. ; 
for large farms sample one field from farms growing 1, 2, 3 or 4 fields, two fields 
from farms growing’5, 6 or 7 fields, etc. ; sample farms in the proportions 
ERN : 
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A similar scheme can be drawn up for the case when c, = $¢,. In this 
case, since f,” is greater than unity, some increase in the proportion of small 
farms included may be advisable. 

Further investigation is left to the reader. 


8.18 Losses due to errors 

We have so far considered the minimization of the cost of a survey when 
a given accuracy is required in the results. Since, however, errors in the 
results themselves give rise to losses when these results are used as a basis 
for further action, the accuracy should itself be determined in such a manner 
that the sum of the cost of the survey and the expected losses due to the 
resultant errors is minimized. 

If the loss due to an error Z in an estimate Y is equal to a Z?, where a is 
a constant, the average loss in a series of samples of the same size and type 
in which the estimates are free from bias will be a V (Y), whatever the actual 
form of the distribution function of the errors. 

When the loss due to an error is proportional to the square of the error, 
therefore, minimization of the sum of the cost C of a survey and the average 
loss due to errors requires minimization of the function 


C+aV(Y) 
the sampling method and size of sample being so chosen that V (Y) has its 
minimum possible value for the cost C. 


Under these circumstances V (Y) will always be expressible as a function 
of C. In many cases, as we have seen, this function is of the form 


h 
C- 
where h and k are constants depending on the population which is being 
surveyed, and Co is the overhead or constant component of cost which is 


independent of the size of the survey. 
In this case the minimum value will be attained when 


v(Y)= k 


(C — Co) =ah 
We then have 
_ vh 
Viy):= Sa 


This implies that the more accurate the results that can be obtained with a 
given cost, i.e. the smaller the value of h, the higher should be the accuracy 
aimed at with a given loss function. The value of the cost-plus-loss function 
at the minimum is in fact 2 C — Co— ak. Any saving due to increased 
accuracy should therefore be divided equally between reduction in the cost 
of the survey and reduction in the loss due to errors. Equally if the loss due 
to a given error is multiplied by a factor 2, the funds devoted to the survey 
(excluding overhead) should be multiplied by 1/4. 
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The same general conclusions hold when the variance-cost relation is of 
a more complicated form than that given above. A case of this type is 
illustrated in Fig. 8.15, in which a loss curve of the form a V (Y) has been 
inserted. We sce that with the more accurate methods of sampling the minima 
of the ‘cost-plus-loss functions (shown by broken lines) are attained when both 
the cost of the sampling and the loss due to errors are less than with the less 
accurate methods. 

Other loss functions will lead to more complicated expressions for the 
average loss. The most general loss function which is capable of relatively 
simple expression in terms of V (Y) is that in which the loss due to a positive 
error is equal to a Zb, and that due to a negative error is a’ (— Z)’, a, a’ and b 
being constants. Provided the distribution of errors has the same form for all 
values of V (Y), the average loss is then equal to a” a9’, where oo? = V (Y) 
and a’ has a value which is a linear function of a and a’. The actual linear 
function can only be determined if the distribution function of the errors is 
known. In general terms, if the distribution function of the errors of Y is 
fı (2) dz, where z = Zoo we have 


a” =a |+ 2° f,(2)de+ a’ f? (3P (2) dz 


If the distribution of the errors is normal, f, (2) dz will be of the form given 
in Section 7.3, with s = 1. The two integrals will in this case (as in any 
symmetrical distribution) be equal. Their values for any value of b can be 
obtained from existing tables*, those for b = 1-0, 1-25, 1-5, 1-75, 2-0 being 
0-3989, 0-4097, 0:4300, 0:4599, 0-5 respectively. It must be emphasized, 
however, that the distribution of sampling errors is frequently not sufficiently 
normal for the use of these values to be justified. In such cases, also, the 
form of the distribution may be expected to change with change in the size of 
the sample. 

Wish this more general loss function we require to minimize the function 


x h bl 
c+a' (a=, -#) 


which will be minimum when 
h 1-6/2 sy 
c-c} (=e -*) =a" bh 


This equation can easily be solved by trial and error, or directly if k can be 
neglected, as will be the case when all sampling fractions are small. The 
same general conclusion that the accuracy should be increased with a more 
accurate method of survey still holds. 

When V (Y) is a more complicated function of C (possibly only determined 
in numerical form) the minimum of the cost-plus-loss function can itself be 
determined by trial and error. 


*Tables of the Gamma function. 
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8.19 Concluding remarks 


The preceding sections give an indication of the ways in which the efficiencies 
of different sampling methods can be compared, and the techniques of 
determining the optimal sampling fractions, size of sample required for a given 
accuracy, etc. It has been further shown that the accuracy which should 
be aimed at is itself related to the losses resulting from errors in the survey 
results. 

Since determination of the optimal accuracy from the expected losses due 
to errors demands knowledge of the loss-function, it will chiefly be of relevance 
when action in the economic sphere has to be based on the results of the 
survey. An error in the estimate of the yield of a crop, for instance, may 
require changes in an import programme, or may lead to wastage, and the 
resultant additional costs may be assessable, at least roughly. The losses 
due to errors in estimates provided by surveys of the research and investigational 
type can scarcely be assessed. Indeed, it is usually impossible to give any 
quantitative estimate in monetary terms of the value of the information provided 
by such surveys. The decision to undertake the survey, and the accuracy 
aimed at, must then be a matter of judgment on the part of those who require 
the information, and those who are concerned with the allocation of resources. 

Even if the optimal accuracy cannot be quantitatively determined, arbitrary 
decisions on the accuracy required should as far as possible be avoided. Before 
any decision as to accuracy is taken, estimates should be prepared of the costs 
of obtaining results of differing degrees of accuracy, and these estimates should 
be considered in relation to the purposes for which the results are required. 

Minimization of costs can of course be carried out whether or not a loss- 
function is available. In this chapter we have only considered this minimization 
when a single quantity requires estimation. In most censuses and surveys 
such treatment would be an over-simplification. A number of quantities will 
require to be estimated, frequently for many domains of study. It may then 
be necessary to carry out a more elaborate investigation, minimizing the cost 
for defined accuracies of all the estimated quantities. Alternatively, if loss- 
functions are available for all of these quantities, the combined cost-plus-loss 
function can be minimized. Frequently, however, one of the quantities is of 
dominant importance, and the situation is such that when adequate accuracy 
js attained on this quantity the remaining quantities are determined with more 
than the required accuracy. In this case minimization can be conducted solely 
with reference to this quantity. 

Many of the examples worked out in this chapter are based on very small 
amounts of data, and the conclusions reached on the relative efficiencies of 
different methods, even in the particular circumstances of the chosen examples, 
must therefore be treated with reserve. These examples are, in fact, merely 
intended to illustrate the computational procedures, and bring out the various 

ints that have to be taken into account when making calculations of relative 
efficiencies, optimal sampling fractions and size of sample. They are in no 
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way intended as general investigations into the relative efficiency of the different 
methods. 

On the other hand, it should be borne in mind that no very exact 
determinations of the optimal sampling fractions and size of sample are required 
in the practical planning of surveys. If the values adopted are somewhere near 
the optimal the total cost, or the total of costs-plus-losses, will be very near 
the minimum. 

We must also not be deterred from undertaking a survey by the fact that 
there is little information on which to base exact planning of the sampling 
methods. As we have seen, surveys themselves provide information which 
will enable future surveys on similar material to be more efficiently planned. 
In surveys on relatively unknown material one of the points to be kept in mind 
in the planning is that information will be required both on variances and on 
costs, Equally, if preliminary rough estimates are required, pilot investigations 
can be designed so as to provide such estimates, as well as information on 
which to base the planning of a larger survey. 

The study of the relative efficiency of different sampling methods depends 
not so much on having a large amount of data as on having data which are 
relevant to the methods concerned. Thus the small pilot investigation on the 
estmation of wheat yields by sampling methods described in Section 8.12 
was of sufficient size to give estimates of both the field-to-field and farm-to-farm 
components of variance with all necessary accuracy. On the other hand, in 
the Survey of Fertilizer Practice—although a very large amount of data has 
now been accumulated—it is impossible to determine the field-to-field 
components of variation of the fertilizer dressings, since only one old- and one 
new-arable field of each crop was taken on each farm. This must be regarded 
as a defect in the planning of this survey, which could have been remedied 
had a pair of fields been taken for the various crops on a small proportion of 
the farms. Incidentally, lack of this information has also prevented any 
considération of the question of the extent to which individual farmers vary 
their fertilizer practice from field to field of the same crop. 

Given the necessary data, the increase in the efficiency of survey methods 
requires proper statistical investigations of the types outlined in this chapter. 
The need for thorough investigation of the efficiencies of different sampling 
methods in different circumstances is great, and it is to be hoped that many 
more will be made and reported in the future. Such investigations are often 
neglected because, once a survey has been completed, the question of whether 
it could have been carried out more efficiently is largely historical as far as 
that survey is concerned. One of the reasons why both the theory and practice 
of sample censuses and surveys has made rapid advances in recent years 
is that permanent organizations—often part of, or attached to, research statistical 
institutes—have been set up in a number of countries. These organizations 
have been actively engaged both in the planning and the execution of surveys 
covering various fields of enquiry. Consequently they have not only had 
access to the necessary data, or the means of collecting it, but they have also 
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had a continuing interest in investigations of efficiency, and a body of workers 
who have both the training and experience to carry out these investigations. 

Further progress may be expected on the same lines. In particular the 
problems that arise in censuses and surveys of undeveloped areas will be likely 
to receive very much more thorough investigation when more centres which 
are actively concerned with the planning and execution of surveys in these 
areas are developed. Only in this way will a body of experience be built up 
which is relevant to the special problems of such surveys. 
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FURTHER NOTES ON THE CRITICAL ANALYSIS OF 
SURVEY DATA 


9.1 Introduction 


As has been pointed out at various places in the preceding chapters, surveys 
fall into two main classes: those which have as their object the assessment of 
the characteristics of the population or different parts of it, and those which 
are investigational in character. In the census type of survey, estimates of 
the characteristics, quantitative and qualitative, of the whole population and 
possibly of various previously defined subdivisions of it are required. These 
estimates form the basis of administrative action, either directly or after in- 
corporation with information from other sources. The accuracy to be aimed 
at is determined by the nature of the administrative action that is envisaged. 
In the investigational type of survey we are more concerned with the study of 
relationships between different variates, and with contrasts between different 
domains. In such surveys estimates appertaining to the whole population are 
usually of relatively minor interest. 

The critical analysis of the results of an investigational survey is a much 
more difficult task than is the calculation of estimates and their errors in a 
survey of the census type. The matter has already been briefly discussed in 
Sections 5.23 and 5.24, and in various of the illustrative examples. In the 
present chapter the matter has been taken somewhat further by the inclusion 
of some additional examples, and by a discussion of the uses of ratios and 
regressions in investigational work. The discussion of sampling errors of 
contrasts between domains—an important, if somewhat tiresome, subject— 
has also been amplified and extended. It must be emphasised, however, that 
the chapter is not intended to be an exhaustive treatise on the analysis of 
investigational surveys: this would require much more space than is available 
here. 


9.2 Contrasts between domains : random sample 


The formulz for estimates of the domain values are exactly analogous to 
those for the population values given in Section 6.4. If the suffix a denotes 
domain A, so that a, for example, is the number of units in the sample falling 
in that domain, and Ja and Sa (y) are the mean and totalof all the y values of 
the domain A units in the sample, we have— 


pa = pa =" (9.2.a) 

Na = gna = N pa (9.2.b) 
I 

Ja = Ja = ~~ Sa(y) ; (9.2.c) 
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Ya = g Sa (y) = Na Ja (9.2.4) 
5 Sa (y) 
fo = Say (9.2.¢) 


The first two formule have already been given in Section 6.4 with a slightly 
different notation (na = u, pa=p, Na=U). The additional symbol pa 
for the proportion of units in the domain A is introduced for convenience. 
As before a small, but usually trivial, gain in accuracy can be obtained by 
replacing N by N. 

In addition to the proportion pa of all units belonging to domain A we are 
often concerned with the proportion of the units of domain A that possess a 
given attribute, or with their total number. This proportion, which may be 
denoted by ha, must be clearly distinguished from pa. Following the notation 
of Section 6.4, the corresponding total number in the population may be 
denoted by Ua, and that in the sample by wa. Then— 


ha = ha = tajna (9.2.£) 
Ua = gua = ha Na (9.2.g) 


The estimation of the variances of Ya, Ya and fa differs in two respects 
from the estimation of the variances of Y, Y, and F. The first cause of difference 
is that only the component of variance of y within the domain A, and not the 
total variance of y in the population, contributes to the variance of the estimates 
Yaand Ya. This component may be denoted by sa?. We have 

2 _ Sa(y—Ja)? 
Sas a aera ta (9.2.h) 

The variances within the different domains will not only differ from the 
total variance, being in general less than this total variance, but may also differ 
amongst themselves. If, however, the population is divided into a large number 
of different domains, and the nature of the material is such that the variances 
within the different domains may be expected to be approximately equal, a 
pooled estimate of this variance over all domains may be adopted, using the 
analysis of variance technique in the manner of Section 7.7. 

In a similar manner the variance of Fa depends on the variance sra? within 
the domain A about the ratio line for that domain. We have 


ee Sa (y— Fax)? 


Na —1 


Sra (9.2.1) 
The second cause of difference lies in the fact that the variance of the 
total Ya is increased because Na is subject to variation. Moreover the numbers 
of selected sample units 7a and nb in two domains are negatively correlated 
the covariance of ma and m» being — n? pa po/(n — 1), and this gives rise to 
negative covariances between pa and po, Na and No, and Ya and Yo. 
Furthermore, for the reason given in Section 7.2, the factor 4/(1 — f ) 
which makes allowance for sampling from a finite population, should nordal 
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be omitted from the variance formulae when comparisons between different 
domains are being made in investigational studies. 

Putting ga = 1 — pa, the resultant formula for the variances and co- 
variances of the estimates are as follows :— 


a Ja 1— P 
V (Pa) ma D (9.2.3) 
a 1— 
COV (Pa, Po) = ae (9.2.k) 
cov (Na, No) = Leta =f (9.2.m) 
: n—1 
V (Ja) = E set (0.2.n) 
cov (Ya, Yo) = 9 (9.2.0) 
ygs LO Deu ED 9.2.9) 
2n? (1 — f) pa po Fa F 
cov (Ya, Yo) = Sst pee ee (9.2.q) 
V (7) =! srat (9.2.1) 
cov (Fa, Fo) = 0 (9.2.8) 


The variances of ha and Ua, and the corresponding covariances, can be 
obtained from the variances and covariances of Ya and Ya by scoring all units 
with the attribute 1 and all those without the attribute 0. This gives 

Ja = ha = ha, Ya= Ua, Sa? = na ha (1 — ha)/(na — 1). 
The quantity in the curly bracket of formula 9.2.p will be found to equal 
na ha (1 — pa ha). 

If each of the domains of study with which we are concerned forms only a 
small fraction of the whole population, the covariances will be small relative 
to the corresponding variances, but if the fractions are large the covariances 
will be of some importance and must be taken into account when calculating 
the differences between estimates for the different domains. 


Example 9.2 
The final poll of the British Institute of Public Opinion in the 1951 election, 
based on a sample of 2,300 individuals, gave a forecast of the voting (excluding 
3-5 per cent. who gave no indication of the way they would vote) as follows 
(Durant and Gregory, 1951, E’). 
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Conservative . à R . 49-5 per cent. 
Labour . : z š . 47-0 = 
Liberal . s s 5 - 30 
Others . é , . . 05 p 


If the sample were a random one, what would be the standard error of the 
difference between Conservatives (A) and Labour (B) ? 


The effective number in the sample is 96-5 per cent. of 2,300, i.e. 2,220. 
Hence 


0-495 x 0-505 
V (pa) = wg -000113 
0-47 x 0-53 
V (po) = og ~ -000112 
0-495 x 0-47 
COV (Pa, po) = — — 919. ~T -000105 


V (pa — ps) = -000113 + -000112 — 2 (— -000105) = -000435 = -02092, 


Thus the predicted Conservative percentage majority of 2-5 per cent. 
would have an estimated standard error of 2-09 per cent. If the covariance 
term had been omitted the estimate of the standard error would have been 
1-50 per cent., which is substantially below the correct value. 

It should be noted that the samples in polls of this kind are actually quota 
samples, and the random component of error will thereby be reduced by the 
stratification thus introduced. The amount of the reduction can be estimated 
if the numbers and voting intention of the different strata (quota categories) 
are known, using the formule of the next section. The reduction will only 
be substantial, however, if the differences in voting intention of the different 
strata are very substantial (see Example 8.2.a). The non-random components 
of error in forecasts of this kind have been discussed in Section 4.22. 


9.3 Contrasts between domains: stratified sample with uniform or 
variable sampling fraction 


Three cases arise. The domains of study may consist of strata or groups 
of strata, they may consist of parts of a single stratum, or they may cut across 
strata. The first and second cases present no new problems. In the first 
case, since the part of the sample belonging to any domain constitutes a random 
or stratified random sample of that domain, the methods developed in the 
previous chapters apply. Moreover, since the sampling of the different strata is. 
independent there will be no covariance between the estimates for the different 
domains. In the second case, the relevant part of the sample constitutes a 
random sample of the stratum concerned, and the formulz of the last section. 
are therefore appropriate. 
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In the third case, in which the domains cut across strata, the variances will 
be affected in much the same manner as in a random sample, except that in 
this case V (fa) also loses its simple form. The formulz for V (Ya) and V (Ya) 
have already been given in Section 7.6 but are repeated here for convenience, 
using the notation of the present chapter. 


V (Na) = 2g? n? (1 — fi) pia qia|(ni — 1) (9.3.a) 

cov (Na, No) = — È gi? mi? (1 — fi) pra pio/(mi — 1) (9.3.b) 
Na? V (Ya) = Egi? ni (1 —fi) {mia gia (Sia — Ya)? + (nia — 1) sia? }/(ni — 1) 
(9.3.c) 


Na No COV (Fa, Yo) = — È gi? ni? (1 — fi) Pia pio (Fia — Fa) (Fib — Yo)/(ni — 1) 
(9.3.d) 


-e) 
cov (Ya, Yo) = — Eg? n? (1 — fi) pia pid Jia iv /(mi — 1) (9.3.f) 
Xa? V (Fa) = Egi?m (L — fi) {ra gia (Fia — Fa Sia)? + (nia — 1) sria?} | 
(ni—1) — (9.3-g) 
Xa Xv cov (Fa, Fo) = — Lagi? n? (1 — fi) pia pib (Via — Fa Kia) (Jia — Fo Xiv)/ 
i (ni — 1) (9.3.h) 

For purposes of computation it is often convenient to replace n; pia by 
nia, etc., and when the factors (1 — fi) are included to replace gj? (1 — fi) by 
&i (gi — 1). , ee ; 

As before, the covariances are of relatively little importance if each domain 
covers only a small part of each stratum. In this case the gia will be nearly 
unity. As mentioned in Section 7.6 and illustrated in Example 7.7.b, an 
approximate estimate of the sampling error of the domain means (and ratio 
estimates) will then be obtained by treating the sample as if it were stratified 
for the domains but not for the strata, provided the sampling fraction is uniform. 
A similar estimate for the errors of the domain totals will be obtained by 
treating the sample in the same manner, but omitting the corrections for the 
means, ie. by replacing Sa(y — Fa)? by Sa (3°). 

This simplification does not hold when the sampling fraction is variable. 
In this case the full formule must be used. 


V (Ya) = È gi? ni (1 — fi) {nia qia Jia + (nia — 1) sia®}/(ni a, 
9.3 


Example 9.3 
In the National Farm Survey (Section 5.21) for the county of Hereford the 
farms of the sample (excluding size-group 1) were classified into the following 
domains :— 
Percentage of arable land 


A Mainly grass wm 0-29-9 
B Intermediate «+ a 30-49-9 
C ' Mainly arable --. ~ 50-100 
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Estimate the variances and covariances of the numbers of farms and the total 
and mean acreages of crops and grass in the various domains. 


The basic data are given in Table 9.3.a (there were no farms in size-group 
5). As an example we may give an outline of the computation of the variances 
and covariances of the mean acreages. The first step is to prepare tables of 
Pia, Jia and sia®. These are given in Table 9.3.b. Tables of qia and (Fia — Ya) 
(not reproduced here) will also be required. For each size-group the term of 
any particular variance or covariance can then be computed. Finally the relevant 
terms can be added and divided by Ne’, etc., to give the variances and covariances. 


'TABLE 9.3.a—NUMBERS, TOTAL ACREAGES AND SUMS OF SQUARES 


Domain 


Size-group A B Cc Total 


Number of farms, Mia, etc. 


2 72 62 39 173 
3 79 155 108 342 
4 13 61 25 99 
Raised total 1,062 1,362 872 3,296 


Total acreage, Sia (y), etc. 


2 3,708 3,553 2,054 9,315 
3 11,570 27,410 18,800 57,780 
4 4,860 22,460 9,410 36,730 
Raised total 93,080 190,090 114,560 397,730 


Sum of squares, Sia (y*), etc. 


2 225,980 243,987 124,252 594,219 
$ 1,838,900 5,409,900 3,582,000 10,830,800 


ie 1,850,400 8,537,800 3,649,900 ` 14,038,100 


TABLE 9.3.b—VALUES OF pia, Jia AND Sia? 


Size-group Zi Pia Piv Pig 
2 10 -4162 -3584 2254 
B} 4 +2310 -4532 +3158 
4 2 -1313 -6162 -2525 
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Size-group Yia Vib Vie Yi 
2 51-50 57-31 52-67 53-84 
3 146-46 176-84 174-07 168-95 
4 373-85 368-20 376-40 371-01 
Ya, etc. 87-65 139-57 131-38 120-67 
Size-group Sia? sw Si s 
2 493-2 661-9 423-0 538-7 
3 1851-4 3654-2 2891-7 3135-0 
4 2792-3 4468-4 4499-0 4192-8 


The results are shown in Table 9.3.c. It will be seen that the variances of 
Yı and Ya are both substantially greater for the separate domains than would 
bz expected from the corresponding variances for the whole county. 


TABLE 9.3.c—VARIANCES AND COVARIANCES 


Variances Covariances 
A B (é; All A, B A G B, CG 
Na, etc. 4559 4668 3662 0 — 2783 — 1776 — 1885 
Ya/100, etc. 3394 6111 4528 2208 — 2028 —1257 —2627 
Yar ete. 19-72 2114 3449 203 — 619 —583 —9-15 


The variances and covariances for the separate domains can be used to 
calculate the variance of the corresponding estimate for the whole county, 
which thus provides a check. For number and total acreage the agreement 
should be exact. For mean acreage the agreement is only approximate, since 
Ja is correlated with Na. We find, in fact, (1062? x 12:72 + ...—2x 
1062 x 1362 x 619—-- .)/3296® = 2-70, compared with the correct value 


of 2-93. 


9.4 Errors in the estimation of the proportions of the population total 
attributable to different domains 


In many cases in which a quantitative variate is being studied we are 
interested in the proportions or percentages which are attributable to different 
domains, rather than the actual totals for the domains. Thus in the National 
Farm Survey, described in Section 3.7, interest attached to the proportion 
of the total farm land that was tenant-occupied, the proportion that was farmed 
by full-time farmers, etc. A Ñ 

Estimates of such proportions are given very simply for all types of sampling 
by dividing the estimate Ya of the total for the domain A by the estimate Y 
of the total for the whole population. Thus, denoting the estimate of the 


proportion by Pa, we have 


SECT. 9.4 SAMPLING METHODS FOR CENSUSES AND SURVEYS 


If the population total is known from other sources we may use Pa to provide 
an alternative estimate Ya’ of the domain total which will in general be more 
accurate than Ya. The formula is 

Ya’ = PaY 
Estimation, therefore, presents no new problems, but the estimates of error 
will be affected by the covariance between Ya and Y. 


Case (a) Domains not cutting across strata 

When a domain A comprises one or more complete strata there will be no © 
covariance between Ya and Ya-, where Ya- is the estimate of the total of the 
remainder of the population. We also have Ya + Ya- = Y. Consequently 
V (Y) = V (Ya) +V (Ya-), and cov (Ya, Y) = V (Ya). The ordinary formula 
for the variance of a ratio then gives 

Y2 V (Pa) = (1 — 2 Pa) V (Ya) + Po? V (Y) 
= Qa? V (Ya) + Pa? V (Ya-) 

where Qa = 1 — Pa. If there are more than two domains the first form is 
most suitable for computation. 

The covariance between the proportions for two mutually exclusive domains 
A and B can be similarly deduced from the formula for the covariance of two 
ratios, which in the notation of Section 7.5 is 


a (@ 2) MSA [see Ya) _ COV(Yi» Ya) COV(Yo Ys) | COV(Yo, te) 


Yo Ys) Wl Hr Yı Ya Y2 Ys Yo Yq 


We thus find 

Y? cov (Pa, Po) = — {Po V (Ya) + Pa V (Yo) — Pa Po V (Y)} 
It is easily verified that when A and B together make up the whole population 
— coy (Pa, Po) = V (Pa) = V (Po) as it should. More generally, if the popu- 
Jation is divided into a number of domains the sum of all the variances plus 
twice the sum of all the covariances should equal zero. This provides a useful 
check when all variances and covariances are required. 


Case (b) Random sample and stratified sample with domains of study cutting 
across strata 
In this case there will be covariance between Ya, Yo, Yc, . . . It is best 
to calculate first the variances and covariances of Ya, Yo, Yc, . . . from the 
formula given in Sections 9.2 and 9.3. The variances and covariances of the 
proportions may then be obtained from the formula— 


Y2 V (Pa) = V (Ya) — 2 Pa cov (Ya, Y) + Pa? V (Y) 
y2 cov (Pa, Po) = cov (Ya, Yo) — Po cov (Ya, Y) — Pa cov (Yo, Y) + Pa Py V (Y) 
with 
cov (Ya, Y) = V (Ya) + cov (Ya, Ys) + cov (Ya, Yc) +... etc. 
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Example 9.4 
Estimate the variances and covariances of the percentage of land in Hereford 
attributable to the three types of farm of Example 9.3. 


The percentages are 


100 Pa = 23-40 
100 Pp = 47-79 
100 Pe = 28-80 
From Table 9.3.c the variance and covariance matrix of Ya, Yo, Ye is 
104x 3394 — 2028 — 1257 
— 2028 6111 — 2627 
— 1257 — 2627 4528 
109 1456 644 2209 


The column totals give cov (Ya, Y) etc., and the grand total gives V(Y). 


We thus have 
(3394 — 2 x 0:2340 x 109 + 0-2340? x 2209) x 108 


TOPY (Bah 397, 730" 
= 2-190 


100? cov (Pa, Po) 
__ (= 2028 — 0:4779 X 109 — 0-2340 x 1456 + 0-2340 x 0:4779 x 2209) x 108 


397,730? 
= — 1-374 
The full variance and covariance matrix is 
a 2-190 — 1-374 — 0-816 
— 1-374 3-302 — 1-928 
— 0-816 — 1-928 2-744 


with the check that each column total is zero, 


9.5 Relative precision of different methods of sampling when domains 


of study cut across strata 

The relative precision of various sampling methods when domains of study 
cut across strata can be studied by the methods already outlined in Chapter 8. 
All that is necessary in any particular case is to estimate and compare the 
expected sampling errors when different methods are used. 

From the results already given it will be apparent that with a uniform 
sampling fraction the gain in accuracy which results from stratification is largely 
lost for domains of study that cut across strata. It is therefore important 
where practicable to use strata which correspond to the expected domains of 
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study. This, however, is not always possible, either because of the resulting 
increase in complexity or because the information necessary for classification 
of the sampling units into appropriate domains of study is only obtained in the 
course of the survey. In the National Farm Survey described in Section 5.21, 
for example, it would have been impossible to stratify for all the various domains 
of study into which the data were subsequently broken down. Information 
on such items as type of occupancy was not known in advance (indeed, collection 
of this information was one of the objects of the survey) ; but even had it been 
available the number of different types of domain, all of which cut across one 
another, was so great that the number of sub-classes thereby created would 
have been far too numerous to be used as strata. 

When a variable sampling fraction is used the situation is somewhat different. 
Although there is likely to be a large increase in variance when domains of 
study cut across strata, there will still be substantial gains from the use of a 
variable sampling fraction in place of a uniform sampling fraction. The 
optimal values of the sampling fractions will, however, differ from those which 
are optimal for the population estimates and will indeed depend on what 
quantities—numbers, totals, means or proportions—require estimation. The 
sampling will be optimal for population estimates of means or totals when the 
sampling fractions are such that the values of fi+/ci are proportional to o; 
(Section 8.17 (a)). For the estimation of a mean of a particular domain the 
sampling will be approximately optimal when fi ./ci is proportional (apart 
from errors of estimation) to the square root of 1/(n; — 1) times the quantity 
in curly brackets in the formula for V (Ya) of Section 9.3. Replacing nia — 1 
by mia and ni — 1 by nj gives fi ci approximately proportional to the square 
root of 


pia {qia (Jia = Jaf E Sia? } (9.5.a) 

For the corresponding total fi 4/ci must similarly be approximately proportional 
to the square root of 

pia {qia Jia? + sia*} (9.5.b) 


The expression for a ratio is similar. 

If the strata consist of size-groups of the variate y, or of a variate x highly 
correlated with y, and if we take fi4/ci proportional to fi or %; for the 
different size-groups, allocation will be about optimal for the estimation of the 
totals of domains cutting across strata, and for the proportions Ya/Y, especially 
in those cases (which are of frequent occurrence) in which sia is about pro- 
portional to the Ji or #1 of the size-group. In the estimation of the means of 
different domains, however, a greater proportion will require to be taken 
from the extreme size-groups, small as well as large. For estimation of the 
number of units in the different domains sampling will be optimal when fi 4/ci 
all have approximately the same value. The best balance between these con- 
flicting requirements will usually be attained by increasing somewhat the 
sampling fractions for the size-groups with small y. This is in fact what was. 
done in the National Farm Survey. 
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Example 9.5 

Using the data of Example 9.3, determine the relative precision of (a) the 
sample of that example, (b) a stratified random sample with uniform sampling 
fraction, and (c) a fully random sample, with regard to numbers of farms in 
the different domains, and their mean and total acreages. 

We will here outline the calculations for V (Ya). It is best, particularly if 
relative efficiencies require to be evaluated, to introduce the simplification used 
in arriving at expression 9.5.b. We then have 

V (Ya) = = (gi — 1) Nia (qia Jia? + sia?) (9.5.c) 
The values of Nia can be obtained from Table 9.3.a by applying the relevant 
raising factors. The quantities gia Jia? + Sia? can be calculated from Table 
9.3.b. Estimates of the proportions hia, etc., of farms falling in the different 
size-groups are also required for each domain. All these quantities are tabulated 
in Table 9.5.a. 


TABLE 9.5.a—VALUES OF Nia, ETC. 


Size-group Nia hia dala + Sia 
2 720 0-677966 2041-6 
se 316 0-297552 18346-9 
4 26 0:0244821 124205-1 

1062 1-000000 


For a stratified sample with uniform sampling fraction containing the same 
total number of farms g = 3296/614 = 5-36808. V (Ya) is given by formula 
9.5.c with all g; equal to g. Thus by summing the products of the second 
and fourth columns of Table 9.5.a, and multiplying by g — 1, we obtain 

e V (Ya) = 4-36808 x 10497000 = 45850000. 

For the random sample the estimated variance Sa? within domain A over all 
size-groups is required. The method of Section 8.3 (c) must be followed. 
The various terms in the formula for sa? (formula for s of Section 8.3.c) can 
be determined from the values already given in Table 9.3.b, and the values 
of hia above. To avoid having to calculate Jia and Ja to more decimal places 
than are given in Table 9.3.b, E hia Jia” — Ya” may be replaced by 

E hia (Jia — Fo) — (Fa — Fo) 
where yọ is a working mean, say 100, near Ya. This comes to 3920-50. We 
also find sia? = 953-62 and È hia (l — hia) sia®/nia = 11-52. Note that here 
the original mia are to be used. Hence 
sa2= 953-62 + 3920-50 — 11-52 = 4862-60. 
The estimated value of ga is 1 — 1062/3296 = 0-677791. Hence 
qa Ja? + sa? = 10069-74, 
and thus from formula 9.2.p 
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V (Ya) = Na (g — 1) (qa Fa? + sa?) = 1062 x 4-36808 x 10069-74 = 46710000. 

For total acreage of domain A, therefore, the precision of the stratified 
random sample with uniform sampling fraction relative to that with the variable 
sampling fraction (Table 9.3.c) is 3394/4585 = 74-0 per cent. The precision 
of a fully random sample relative to a stratified sample with uniform sampling 
fraction is 4585/4671 = 98-2 per cent. There is, therefore, considerable gain 
by the use of the variable sampling fraction but very little gain by stratification. 

The full results for number of farms, total acreage and mean acreage are 
shown in Table 9.5.b. For number of farms the stratified sample with uniform 


TABLE 9.5.b—RELATIVE PRECISION (PER CENT.) OF DIFFERENT TYPES OF SAMPLE 


A B c Whole 

county 
No. of farms . 152 136 132 — 
u.s.f./v.s.f. Total acreage . 74 65 62 84 
| Mean acreage . 77 101 95 84 


No. of farms 2 9 98 99 — 
Random/u.s.f. 4 Total acreage . 98 TI 88 21 
Mean acreage . 82 60 80 21 


sampling fraction is on the average about 40 per cent. more precise than the 
sample with variable sampling fraction. For total acreage, on the other hand, 
the sample with variable sampling fraction is about 50 per cent. more precise 
than the sample with uniform sampling fraction, the gain in precision being 
greater for the separate domains than for the whole county.* For the mean 
acreage the relative precision for the different domains is very variable. There 
is a gain by use of a variable sampling fraction for domain A but not for domains 
Band C. The random sample is always less precise than the stratified sample 
with uniform sampling fraction, but the gain due to stratification is very vaíiable 
for the different measures and different domains. These results are, of course, 
what would be expected from the nature of the variances. 

The relative efficiencies will be somewhat nearer unity than the relative 
precisions. A simple method of calculating them is described in Section 10.12. 


9.6 A further example of the critical analysis of survey data: factors at 
two levels 


The following example of a pilot investigation of the effects of various 
factors on family size is due to N. Keyfitz (1953, C’).+ The investigation is of 


* This is less than the relative efficiency reported in Section 5.21 because the smallest 
size-group has been omitted and there are no farms in the largest size-group in this 
conn This was first presented at a lecture given by Dr. Keyfitz at the London School of 
E mics. I am greatly indebted to him for providing me with details of the investi- 

cono! in advance of its appearance in published form. For a fuller discussion the 
Bublished paper should be consulted. 
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general interest, as it shows how the simultaneous effects of a number of 
quantitative factors can be studied by treating them as if they were qualitative 
factors each at two levels. It also provides an illustration of the possibilities 
of carrying out analyses of this kind on a small sample of census material— 
analyses which would be quite intractable if the whole of the material were 
included. 

Table 9.6.a shows the average number of children ever born per family 
and the numbers of families in a small sample from 16 counties in the Province 


TABLE 9.6.a—1941 Census OF CANADA: AVERAGE NUMBERS OF CHILDREN 


AND NUMBERS OF FAMILIES IN A SMALL SAMPLE CLASSIFIED IN SIX WAYS 


Present age 
45-54 55-74 


Age at marriage 
15-19 20-24 15-19 20-24 


Years of schooling 
0-6 7+ 0-6 7+ 0-6 7+ 0-6 7+ 


Average number of children 


Low income, French area : 


5 French area? 9.4 10:7 103 %8 10-1 14-5 104 9:8 
Kan froma: ety > | p4 129 8&3 67 100 110 76 8-6 
Low i ixed area : 
A enn or a 12-9 109 89 98 83 128 84 9:6 
Nae o7 113 94 TL 90 99 86 86 
Hi h i a F } ea: 
igh income, French area: 19.9 129 10:6 98 121 125 9-0 118 
pa hop oe -"g3 87 T1 10-3 10-8 13-2 10-9 9-9 
High income, mi : 
eel ade area: jog 143 9-4 112 10-6 120 99 90 
. log 122 76 88 10 110 86 84 


Near city : . . 
Number of families 


Low i Pre: F- 
ow income, French area 15 14 35 20 18 6 34 12 


Far fi itv . ¥ m 5 
Near city ay r A 5 ib 8s 10 37 9 8 15 22 
Low i i : 
Farifroni eer Sd 1 eb Aa ee ee OR 
Near city ‘ Es oe g T 14 49 12 8 17 29 
High i f, 
Bar e n area S on 29 24 29 31 ië 95 T 
Near city Í 2 et 15 i 28 14 18 14 30 
High incom x : 
Far from ea aaa eh | eT 2 gro ae 
Near city A i -A76 B 96: EL 3. 20 0 
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of Quebec. The data are taken from the census schedules of the 1941 census 
of Canada. 1,056 Roman Catholic, French-speaking farming families of French 
origin are included. The data are classified in a six-way classification according 
to all combinations of— 


Wife's 
Present Age at Years of „Farm Relation Type of 
age marriage schooling income to city area 


45-54 15-19 0-6 f Low Far French 
P iaa lar al la 
( 55-74 20-24 7+ High Near Mixed 
The classifications for income and relation to city refer to counties, and not 
to individual families. (Data for incomes could not be obtained for separate 
families without excessive labour.) The classification for area refers to sub- 
districts. All six classifications refer to quantitative factors. They have been 
converted to qualitative factors, each at two levels, by grouping the data ; the 
groupings chosen exclude entirely extreme values of some of the factors. 

Schematically the data now correspond to the results of a factorial experi- 
ment with 6 factors each at two levels. They differ, however, from experimental 
data in that the number of families, and therefore the accuracy, varies from 
cell to cell. Moreover since the classifications do not represent imposed ex- 
perimental treatments the conclusions are subject to the qualifications set out 
in Section 5.23. In this case also there is the further qualification that the 
last three classifications may be affected by other factors which are common 
to counties or sub-districts. 

In order to estimate the average effect of each factor separately, freed from 
the effects of the other factors, the method of weighted means of differences 
of sub-class means described in Section 5.23 (3) can be used. For each factor 
there are 32 pairs of values which differ in the factor in question and are the same 
for all the other factors. For relation to city, for example, the first pair of cells 
gives the difference (near—far) 7-4 — 9-4 = — 2-0, with weight 1/(1/15 + 1/5) 
= 3-75. The weighted mean of all the 32 differences for this factor is found 
to be — 1:28, with total weight 234.* A pooled estimate of the variance within 
cells of the number of children per family can be calculated from the numbers 
of children in the separate families (not reproduced here). The value obtained 
is 18-15. The standard error of the above estimate is therefore 


“/(18-15/234) = + 0-28. 

The effects of the other factors may be calculated in a similar manner. 
The full results are shown in Table 9.6.b. A plus sign in each case indicates 
that families at the higher (or indicated) level of the factor have the larger num- 
ber of children. It will be seen that four of the six factors show significant 
effects. 

* The form of these computations is set out in greater detail in Section 9.7, 
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TABLE 9.6.b—EFFECTS OF THE SIX FACTORS 


Eliminating 


che ee ce 
Present age 7 . : + +038 + 0-27 + 0-30 + 0:28 
Age at marriage . z š . — 1-77 + 0-28 — 2-02 + 0-28 
Years of schooling š š . +0-72 + 0-28 + 0-16 + 0-28 
Income . š è : . + 0-90 + 0-28 + 1-20 + 0-27 
Relation to city (near — far) . . — 1-28 + 0-28 — 1-58 + 0-27 
Type of area (mixed — French) . — 0-15 + 0:28 — 0-74 + 0-28 


The overall average effect of each factor, ignoring the other factors, is also 
shown in Table 9.6.b for comparison. The greatest differences are in years of 
schooling and type of area. It should be noted that the estimates obtained by 
eliminating the effects of other factors do not necessarily give the best estimates 
of the total effects of the factors. If, for example, age at marriage tends to be 
increased by longer schooling the total effect of schooling may be small, although 
amongst women married at the same age those with longer schooling may be 
more fertile. This question is discussed at greater length in Section 9.11. 

The above analysis is fully appropriate only when there are no marked 
differences (relative to the variability of the data) in the effects of each factor 
at different levels of the other factors. When there are such differences the 
effects are no longer additive and the factors are said to interact. When there 
are interactions then not only the estimates but the true values of the average 
effects will depend on the weights which are employed in arriving at the mean. 
It should be recognised that even in this case it is not incorrect to use the weights 
which give the most accurate estimates, but the quantities estimated are then 
in part determined by the actual (known) weights employed, which are them- 
selves dependent in part on the chance fluctuations of sampling which determine 
the numbers in the various cells. When the interactions are substantial, 
therefore, and comparisons between the estimates derived from different 
groups of the population are required, it may be better to obtain estimates. 
using weights based on some standardised proportions in the different cells, 
since if this is not done the comparisons will contain components of interaction 
from which we may wish to free them. If the actual frequencies in the different 
cells are nearly proportionate, then weights based on proportionate frequencies 
may be taken as in Method 2 of Section 5.23. : 

In general, interactions are only likely to be large if the factors concerned 
produce large average effects. Their existence and magnitude can be examined 
by the same method as that illustrated above for the average effects. To 
investigate the interaction of relation to city and schooling, for example, we 
take sets of four cells which differ in these two factors but have the same levels 
for all other factors, Each 2 X 2 group of cells in the table is such a set. For 
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the top left-hand group we have 
Far from city 10-7 — 9-4 = + 1-3 

Effect of schooling 
Near city 12-9 — 7-4 = + 5-5 


Difference (near — far) -+ 4-2 
The weight of this difference is 1/(1/15 + 1/14 + 1/5 + 1/8) = 2-16. There 
are 16 such differences, and the weighted mean is found to be — 0-40 with 
total weight 53-37, giving a standard error of 0-58. For the reason given below 
it is customary to define the interaction as one-half the difference of the effects. 
‘The estimate of the interaction is therefore — 0-20 + 0-29. 
In this example only two of the 15 two-factor interactions are substantially 

larger than their standard errors. These are 

Age at marriage x schooling — 0-67 + 0-29 

Present age X type of area — 0-73 + 0-29 
The second may reasonably be regarded as a chance effect since neither present 
age nor type of area produce a significant average effect. The first, however, 
is between two factors both of which produce significant average effects, and 
may be taken to indicate that the effect of each factor is influenced by the 
level of the other factor. For two factors a and b which have average effects 
A and B and interaction A.B the effect of a at the lower level of b will be 


A—A.B 


and at the higher level will be 
A+A.B 


and similarly for b. Thus the effect of schooling at the lower age of marriage 
will be 


+ 0-72 — (— 0-67) = +1-39 
and at the higher age will be 

+ 0-72 + (— 0-67) = + 0-05. 
The reason for the factor } in the interactions will now be apparent ; it makes 
the average effects and interactions directly additive. 


The results can also be put in the form of a2 x 2 table. If M is an estimate 
of the mean, the cells of the table will be 


ba by. 
a- |M—}A—}B+}4B  M—4443B—}aB 
a,|M+3A—4B-34B M44441B444B 
In this case, taking the unweighted mean, 10-13, of the cell values for M we 
obtain 


Schooling 

0-6 nas 
Age at 15-19 10-32 11-71 
marriage 20-24 9-22 9-27 
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Without knowing more about the data it is difficult to offer an explanation of 
this apparent difference in effect of schooling. 

It may be noted that if the interactions can be assumed to be negligible 
and there are more than two factors, the above method will not provide quite 
the most accurate estimates, even when all factors are at two levels only. An 
efficient process of estimation is then provided by an extension of the method 
of fitting constants, exemplified in Section 5.24. The gain in efficiency is, 
however, likely to be small unless the data are very fragmentary. In the present 
example the gains for the six factors ranged from 2 per cent. to 12 per cent. 
Such gains would not justify the additional computational labour, which is 
better devoted to extending the scope of the investigation in other directions. 

The procedure of grouping and working with factors at two levels provides 
an alternative to multiple regression analysis (Section 9.9). In data of this 
complexity, regression analysis would be exceedingly laborious, requiring the 
evaluation of 28 sums of squares and products and the inversion of a 6 x 6 


matrix. Moreover the regression technique does not readily lend itself to the 


investigation of the existence of interactions. 
It should be recognised, however, that the effects estimated by the above 


procedure do not give direct estimates of the regression coefficients (i.e. the 
change per unit change of the factor). Thus the difference between the two 
age-at-marriage groups, 15-19 and 20-24, is — 1-77. We cannot assume that 
this represents the change resulting from a change of marriage age from 17-5 
to 22-5, since the marriages will not be evenly distributed within the groups— 
there will be very few marriages, for example, at age 15. Instead an estimate 
of the change per year can be made by dividing — 1-77 by the difference in 
mean marriage age for the two age-at-marriage groups taken over the whole 
of the data. A more accurate procedure is to calculate the difference d in mean 
ch pair of cells which goes to make up the weighted mean 
ke a weighted mean of these differences (using the same 
weights as in the main calculation). This will provide the appropriate divisor 
for estimating the rate of change. If the relationship is really linear a more 
accurate estimate will be obtained by taking new weights equal to d times the 
old weights for both means, but the differences between the different d are 
not likely to be sufficiently large for this to be worth while. 

The efficiency of estimates of regression coefficients based on data grouped 
into two classes js not unduly low. If all variates are normally distributed and 
the divisions between the groups are located at the mid-point of each distribution 
an efficiency of 64 per cent. is attained when the regressions are truly linear.* 
There will be some additional loss if the divisions are not taken at the mid-points. 
In practice the effective efficiency is likely to be greater than normal theory 


marriage age for ea 
difference — 1-77, and tal 


te a greater efficiency will be attained if the central 
d. The maximum efficiency with this procedure is 
tof the distribution is omitted. This procedure 
endent variate, however, since the central values 
to the same units. 


313 


* For a single independent varia 
posin of the distribution is rejecte 
©” Per cent. when the central 46 per cen! 
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of the different variates will not appertain 
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would indicate, since the occasions on which regressions are truly linear are 
somewhat rare. Moreover gross errors in the independent variate produce 
less disturbance in data grouped into two classes. When a considerable amount 
of data is available, as in the present example, there is no doubt that the gains 
from the more detailed analysis which are possible with the grouping method 
will far outweigh the theoretical gains in efficiency of the regression method. 


9.7 Qualitative data: separation of the effects of different factors 


When the effects of a number of factors on a qualitative variate are under 
consideration the approach by means of regression and fitting constants requires 
modification. The difficulty arises from the fact that when proportions are 
analysed different factors cannot be expected to produce a strictly additive 
effect, since the proportions themselves can only lie in the range 0-1. The 
effects may often be made more nearly additive by re-scaling the proportions, 
using some transformation which gives a transformed range from — œ to + œ. 
The two transformations which are commonly used for this purpose are the 
logit and probit transformations. 

The logit transformation is given by, the equation 

y= loge 
and has been tabulated by Finney in Statistical Method in Biological Assay 
and elsewhere, where 5 is conventionally added to the y value to avoid negative 
values. Alternatively the 7, æ transformation tabulated in Statistical Tables 
can be used, taking r = 2p — 1 and giving z the same sign as r. We then have 
z = } loge p/q, 
ie. half-logits. 

‘The logit transformation has the property that for equal intervals on the 
logit scale the odds (Ap : 24) are changed by the same factor. Since 
loge 2 = 0:69315 a change of this amount in the logit represents a doúbling 
of the odds. Thus odds of 4:1 will be changed to 8:1, corresponding to a 
change of p from 0:8 to 0-889. If the logit scale (without the addition of 5) 
is altered to logits (base 2) by multiplication by 1:443 (= 1/0-69315), O will 
represent even odds, + 1 odds of 2:1, + 2 odds of 4:1, — 1 odds of 1:2, 
etc., and a unit effect will correspond to a factor of 2 in the odds. This enables 
the results to be presented in a form in which their meaning is easily understood. 
If the r, z transformation is used the multiplier is 2-885. 

The probit transformation is the normal deviate corresponding to the 
integral (taken from the left) of the normal curve with unit standard deviation 
(see Fig. 7.3). It is therefore given by the inverse of Table A2, taking P = 2p 
and giving the deviate (probit) a negative sign when p< 0-5, and taking 
P = 2 (1 — p) and giving the deviate (probit) a positive sign when p>o-5. 
To avoid negative values 5 is conventionally added to the probit so obtained. 

Thus for a probit of 4-5, P = 0-6171 and p = 0-3086. Full tables are given 
in Statistical Tables and by Finney (loc. cit.). 
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There is no reason to expect that the effects of factors will be exactly additive 
in the transformed data. As in quantitative data there may be interactions, 
which will indicate departures from the additive law. Moreover if the effects 
are additive for one transformation they will not be exactly additive for any 
other. In extensive work it may under certain circumstances be worth in- 
vestigating what transformation produces the closest approach to additivity. 
The logit and probit transformations, however, are the type of transformation 
which may under many conditions be expected to give approximately additive 
effects. For any but the most precise data they are sufficiently similar for it 
to be immaterial which is used. 

As an example of the type of analysis involved we may take the data given 
by Lombard and Doering (1947, A’) obtained in the course of a survey on cancer 
knowledge. (See also Dyke and Patterson (1952, A’), where the data are 
analysed by means of logits, using a somewhat different method from that given 
here.) The object of the survey was to determine the influence of various 
possible sources of information on the knowledge of cancer. The individuals 
included in the survey were classified according to whether they read newspapers, 
listened to radio, etc., and also according to whether their knowledge of cancer 
was good or poor. The data so obtained are shown in the second and third 
columns of Table 9.7.a, where the four factors are represented by 

a (newspaper reading) 
b (radio) 

c (solid reading) 

d (lectures). 


The corresponding 7 values (half-logits) are shown in column 4. (These were 
taken from Dyke and Patterson’s paper and are based on values of p to two 
decimal places.) 


> 
TABLE 9.'7.a—LOMBARD AND DorERING’S DATA ON CANCER KNOWLEDGE 


Weight Variance 


n p z 4npq 1/4npq 
s3 477 +176 — 78 277 +0036 
3 231 -325 =37 202 0050 
b 63 +206 —-68 41 -0244 
ab 94 372 —:27 88 “0114 
5 150 447 — al 148 0068 
sais 378 532 +06 376 0027 
be 32 +500 00 32 -0312 
abe 169 “604 +21 162 -0002 
d 12 +167 — $1 ji 1429 
HA 13 -538 +08 13 0769 
bd 7 “671 +-14 7 1429 
aba 12 667 +34 11 -0909 
td 11 273 —-48 9 ‘ll 
aA 45 -600 +20 43 -0233 
bed 4 -250 —-55 3 -3333 
abed 3 ° 2 +52 24 0417 
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The method of Section 9.6 for evaluating the effects of the separate factors 
can now be followed. Instead of a variance proportional to 1/n, however, a 
z-value based on z individuals will have a variance of 1/4pgn. These variances, 
calculated from the observed p’s, are shown in column 5. The effects of the 
various factors (and, if required, their interactions) can now be estimated from 
the weighted mean of the individual differences. The calculations for A are 
shown in Table 9.7.b. The first two lines, for instance, give a difference of 
— 0:37 — (— 0-78) = 0-41 with variance 0-0050 + 0-0036 = 0:0086. The 
weighted mean, -+ 0-325, has an estimated standard error of +/(1/295). 


TABLE 9.7.b—CALCULATION OF THE A EFFECT 


z Variance Weight 
a—0 + -41 -0086 116 
ab — b + -41 +0358 28 
ac—c +-17 -0095 105 
abc — be + -21 +0374 27 
ad—d + -89 -2198 5 
abd — bd + -20 -2338 4 
acd — cd + -68 1344 sf 
abcd — bcd + 1-07 +3750 3 
+ 0-325 1-0543 295 

The values obtained are : 

2-values logits (base 2) 


A -+ 0-325 + 0-058 + 0-938 + 0-167 

B + 0-147 + 0-062 + 0-424 + 0-179 

Cc + 0-500 + 0-055 + 1-442 + 0-159 

D + 0-223 + 0-098 + 0-643 + 0-283 
The effects of all four factors are significant. Their relative. magnitude can 
be approximately examined by means of the standard errors shown. ‘These 
indicate, for example, that solid reading (c) produces a larger effect than any 
other factor. The estimates are, however, not independent, owing to the 
„inequality in the numbers in the different sub-classes. An exact test of the 
difference in effect of a and c, for example, can be obtained by testing the 
weighted mean of the differences c — a over all combinations of the other 
factors. The calculations are similar to those shown in Table 9.7.b. 

In addition to estimates of the effects an estimate of the mean g may be 
required. This can be obtained from the unweighted mean of all the ¢’s. 
If the numbers in some of the cells are small, however, it is better to take the 
weighted mean of the 2’s (with weights 4npq). We here have S (wz)/S (w) = 
— 286-9/1443, This weighted mean requires adjustment if it is to represent the 
mean of a population with equal numbers in all cells. If Sa+(w) is the sum of 
the weights of the cells with a, and Sa — (w) is the sum for the cells without a 
the correction to S (wz) due to the a effect is — $ A {Say (w) — Sa-(ai)}. 
Here we have Sa+(w) = 919 and Sa-(w) = 524. The correction is therefore 
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— } 0-325{919 — 524} = — 64-2. The corrections for the b, c and d effects 
are calculated similarly. Hence the corrected value of the mean is 
m = { — 286-9 — 64-2 + 52-0 — 37-8 + 134-8}/1443 = — 0-140. 


The expected value of z in any cell can now be calculated. That for abcd, for 

example, is 

m+hA+44B+3C+4D = — 0:140 + 0-162 + 0-074 + 0-250 + 0-112 
= + 0-458 

corresponding to a value of p of 0-714. For the nil combination p = 0-186. 

The above calculation is approximate, since the variances and weights 
have been estimated from the observed values of p. If any of the p’s are 0 or 1 
the process breaks down entirely, since ¢ is + © and its variance is also infinite. 
If the 0 or 1 values arise from cells with small z they can be rejected (7.e. given 
zero weight) without serious error. If, however, there are cells with large n 
which have values of p equal to or very near 0 or 1 a further adjustment will 
be needed. ‘The procedure is similar to that which has become familiar in 
probit analysis. The necessary formule for both logits and probits and an 
example and tables for the probit case are given in Statistical Tables. More 
details and the necessary tables for the logit case will be found in Finney’s 
Statistical Method in Biological Assay. 

The important point to notice about the above method of analysis is that 
it provides quantitative estimates of the effects of the different factors. It 
therefore differs radically from the classical approach to the analysis of qualitative 
data by means of 7. The y® analysis provides tests of significance, but not 


estimates of the magnitude of the effects. The application of x? to multiple 


contingency tables of which the above is an example is, moreover, very 


complicated. 


9.8 ° Use of ratios and regressions in investigational work 


ns have many uses in investigational work. Whenever 
variates is such that the ratio between them may be 
onstant then the replacement of the variates by some 
estimate of the ratio is likely to simplify considerably the interpretation of the 
data. The choice of estimate depends on the nature of the variability. Tf x 
and y do not vary very widely then the unweighted means of the individual 
ratios r (= y/x) calculated from the pairs of values are likely to give the most 
accurate results. Under such circumstances the biases of the unweighted means 
are usually of relatively little importance 1n comparative work. Once the 
individual ratios have been calculated the unweighted means are less trouble 
to compute than any form of weighted mean, particularly when many alternative 
groupings of the data are required. If on the other hand x and y vary widely 
(and particularly if there are some very small values) the estimates r = S (y)/S(x) 
may be subject to smaller errors, aS well as being unbiased. 


Ratios and regressio: 
the nature of a pair of 
expected to be relatively 
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Several examples where ratios have proved of use have already been given. 
In Example 6.19 both types of estimate were examined. 

It is important to recognise that the use of a ratio does not imply that its 
value is necessarily constant over the range of x. If it appears desirable the 
relation between r and x (or r and y) can be examined, e.g. by the use of a 
regression of r on x (or 7 on y). It is sometimes objected that this procedure 
is inadmissible since random errors in x will give rise to a negative correlation 
between r and x. This objection is, however, only valid if x and y are in fact 
subject to random errors which are independent. In a study of the relation 
of earnings to factory size in an industry, for example, the size, as represented 
by the number of workers, will usually be correctly ascertained, and the relation 
between the average earnings per worker and size will therefore not have any 
spurious component of correlation ; by taking earnings per worker and size 
instead of total wage-bill and size as the variates for analysis we obtain the 
data in a form which is considerably easier to study. 

When the value of the ratio changes considerably over the range of x it 
may be more appropriate to use the regression of y on x. Either a linear or a 
curved regression may be used. The calculation of a linear regression has been 
described in Section 6.12. 

In investigational work we may be interested not only in the relation of 
y to x over the whole population, but also in differences in this relationship 
for different domains of study. In such cases ratios or regression lines can be 
calculated for each domain separately. For comparative purposes slight 
departure from the assumed law is often of little consequence. Thus provided 
the mean of x is similar for the different domains linear regressions may be used 
when some degree of curvature in the lines is apparent from the data. If, 
however, the means of x differ considerably the slopes of the linear regressions 
will differ, although the whole of the data may in fact be adequately described 
by a single curved regression line. 

A full discussion of the use and interpretation of regression analysis i$ not 
possible here. One or two points may, however, be mentioned. 

If regressions are calculated for different domains of study which are, for 
example, parts of a random sample, or strata of a stratified sample, we shall 
obtain regression equations of the form: 


Domain A: y, = Ja + ba (x — Sa) 
Domain B: y, = Fo + bo (x — ) 
etc. To enable these equations to be compared directly £a, o, etc., must be 
replaced by some standard value xo which should be chosen conveniently near 
the general mean. The equations will then become 
Yı = Yao + ba (x — Xo) 
Yı = Yoo + bo (x — xo) 
etc., with yao = Ja + ba (Xo — ča), etc. The formula for the error of Yao, etc. 
has been given in Section 7.12. The differences between the stadia 
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values yao, etc., will represent the differences between the domain y’s for the 
standard value xo of x. If ba, bo, etc., differ substantially, the differences 
between yao, Yoo, etc., will depend on the value chosen for xo. If there is no 
evidence of differences between the b’s then a mean b can be taken (in which 
case all the regression lines will be parallel). The mean 6 can be calculated 
from the formula of Section 6.13. The question of whether the b’s differ can 
be examined by means of the standard errors of ba, bb, etc., calculated from 
the formula of Section 7.12. The analysis of variance can also be used in this 
connection, but the procedure is rather complicated for those not familiar 
with the technique. (See, for example, Quenouille, Associated Measurements.) 

A common tactical error made by those unfamiliar with regression work is 
to take xo as zero, and re-write the regression equations in the apparently simpler 
form 

Yı = aa + ba ¥ 

Yı = a + by x 
etc. Unless the means of x are near to zero these equations are unsuitable 
for comparison, for although aa, a, etc., give the estimated values of y for 
x = 0 they are subject to large errors, both because of errors in ba, by, etc., 
and because although the assumption that the regressions are linear may be 
reasonably correct over the range of x actually covered, it may be by no means 
correct if the range is extended to zero. 

When comparing regression lines it is often useful to plot all the lines on 
the same graph. Relations which are implicit in the equations will then be 
immediately apparent. 

Regressions can easily be calculated from grouped data. If the data are 
grouped for « only and the mean value of y is calculated for each group, the 
plot of these values (apart from sampling errors and a small error introduced 
by the grouping) represents the regression of y on x. With a large sample 
this i9 quite a good way of examining the regression. At the same time the plot 
will reveal whether the regression is truly linear. It is often advisable, however, 
to make an exact calculation of the regression from the group means, rather than 
nee it is difficult otherwise to make proper allowance 
f observations in the different groups. If, mg, . . . 
xy, . . . the values of x for the mid- 
the means of y, the formula for the 


to draw a line by eye, si 
for the varying numbers © 
represent the numbers in the groups, Xis 
points of the groups, and Fy Ja- 
regression coefficient will be 

b= -S (ur ar?) — 0 
where n = S (n), = S ("r ar)/t I= S (nr Fr)fn. 

With two variates there are two regression lines, that of y on x and that of 
xony. The data themselves will not tell us which, if either, of these regression 
lines is appropriate for our purposes: If we wish to estimate the values of y 
for individual units for which only the value of x is known, or the mean y 
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for a set of such units, then the regression of y on x will give the appropriate 
estimation equation, provided the units can be regarded as further random samples 
of one unit each from the population of which the data are a sample. i This is the 
justification for using the regression method in the manner outlined in Chapter 6 
for improving the accuracy of a sample for which supplementary information 
is available either from the whole population or from the first phase of a two- 
phase sample. For estimation of this type, errors in x can be ignored in cal- 
culating the regression. If x is subject to large errors the numerical value of 
the regression coefficient will be reduced. We shall consequently estimate 
the unobserved y to be closer to the mean of the already observed y than 
would be the case if x were free from error. That this is as it should be is 
obvious if we consider the limiting case where xw is subject to such large errors 
as to be worthless. In this case the mean value of the observed y provides 
the best estimate of all unobserved y. 

If, on the other hand, we are concerned with the estimation of the mean 
of the y’s of a new sample which, although possessing certain features in common 
with the original sample on which both x and y were measured, cannot be 
regarded as a random sample from the same population, the whole situation 
is altered. We then have to consider in much more detail the nature of the 
measured variates and the errors affecting them. This situation is discussed 
in more detail in Section 9.10. 

If the regressions are being used in investigational work we shall not usually 
be concerned with problems of the estimation of further values of y from 
observed values of x. We shall rather be concerned to evaluate the underlying 
laws. which govern the relationships between various variates, In this case 
if y is believed to depend causally, in part at least, on x we shall normally 
require the regression of y on x. This will give the relation between « and the 
mean value of y for given x. The variation of the actual y’s about the mean 
value of the y’s for a given x may then be attributed to the influence of, other 
variates on y and to random errors in y (of observation, etc.). If x is subject 


to error a correction to the regression coefficient will in this case be required, 
as is explained in Section 9.10. 


9.9 Multiple regression 


Instead of taking a regression on a single variate it is possible to take a 
multiple regression on two or more variates. The formule will be given for 
two variates ; they can easily be extended to more variates when required. 

To shorten the formula we may write S,, for S (x, — %,)?, Sy. for 
S (x1 — $1) (x2 — Ža), Sty for S(x; — %,) (y =J) etc. If xı and x, are the 
two independent variates the regression can be written in the form 


Jı = F + bı (%1 — 1) + by (x2 — 2o) (9.9.a) 


The regression coefficients 6, and b, are given by the solution of the two 
simultaneous linear equations (known as the normal equations) : 
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by Su + be Sie = Sty} 

bi Si + bz S2 = Səy) 

In order to obtain the standard errors of b, and b,, or any linear function of 

b, and by, it is necessary to perform an operation known as inverting the matrix 
given by the coefficients of the b’s in the above equations. This is equivalent 
to solving the two pairs of equations of which the left-hand sides are those of 
the normal equations, but which have as numerical terms 1, 0 and 0, 1 respec- 
tively, instead of Siy, Sey. The values of b, and by given by these equations are 
commonly denoted by ¢y1) 12» ANd Cor, C22- The c’s form the inverse matrix, 
which has diagonal symmetry, so that c12 = Co1- This property considerably 
reduces the labour of inverting a large matrix. Methods of performing the 
inversion expeditiously when there are a number of variates ‘are described in 
many statistical textbooks (e.g. Statistical Methods for Research Workers) and 
will not be given here. Whatever the method followed the values obtained 
should be substituted in the original equations to see that they are really 


satisfied. 
When the c’s have been obtained the b’s can be calculated from the formule 


by = en Siy + 412 ea 
gg 
bo = C12 Siy + C22 Soy ( c) 


eviations from the regression line will be 


(9.9.b) 


The sum of squares of the d 
Q = S (y — I} = Syy — br Sty — be Sey (9.9.d) 
f freedom have been absorbed by the regression line, and 
m because deviations from the mean have been taken, the 
total number of degrees of freedom remaining will be 7 — 3. The residual 
variance of a single observation after fitting the regression is therefore 
52 = Oln — 3) 


Since two degrees 0 
one degree of freedo’ 


We then have 
Vivica  ¥ (by) = C22 5° oy Orti C95 
Hence the variance of any linear function is given by 
V (L by + lo be) = (h° On + 2l, ls Gra + lè ca) 5? (9.9.6) 
The reader should verify that if by is omitted iom the above formulæ the 
formulæ already given in Sections 6.12 and 7.12 for a regression on a single 


variate are obtained. ; ili i 
The above formulæ can be adapted to the fitting of curvilinear regression 


lines. If, for example, we require to fit a quadratic 
y=a+ bx + cx 
can be calculated. Since x and a? are highly 
` correlated if all x are of the same sign it is better for purposes 5 computation 
to take (æ — a»)? for the second variate, vo being a convenient value of = near a 
If the values of x are equally spaced, curvilinear regressions can be con- 


i ials. The necessary tables 
veniently fitted by the use of orthogonal polynomia : y e 
and instructions ae their use will be found in Statistical Tables. Their main 


1 : p2 
a multiple regression on * and x 
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use in survey work is in the analysis of data from surveys carried out on successive 
occasions, with time as the independent variate. 


9.10 Regression: effect of random errors in the variates 


Random errors in the dependent variate. make the regression coefficients 
less precise, but no consistent error is introduced : as the sample size is increased 
the coefficients tend to the underlying population values. If, on the other hand, 
there are random errors in the independent variates the estimated coefficients, 
regarded as coefficients of an underlying regression law, are subject to consistent 
errors which do not decrease as the size of the sample is increased. In the case 
of a single independent variate with errors of x uncorrelated with those of y, 
a consistent estimate b’ of the coefficient f of the underlying regression law is 

b’ = 3/1 —h) (9.10.a) 
where A is the ratio of the error variance of x to the total variance of x (including 
the error variance). Thus the regression coefficient b calculated in the ordinary 
manner is on the average too small in absolute magnitude, and is said to be 
attenuated. 

In the case of a multiple regression with independent errors in the different 
x’s the estimation of the coefficients of the underlying regression law will be 
obtained by replacing S41, Syo, etc., in the normal equations by S}, (1 — hı) 
Soo (1 — ho), etc. If the errors of the x’s are correlated similar corrections are 
required for S12, etc. If the errors of the x’s are correlated with those of y 
the terms Sj), etc., must also be corrected. ` 

These adjustments can only be made if the relevant error variances and 
covariances can be estimated. If they arise solely from sampling errors this 
will often be possible. Unfortunately errors in the independent variates are 
not confined to sampling errors. Errors of observation and measurement, and 
failure of the chosen measures to represent what is really required, will produce 
similar disturbances. Errors of observation and measurement frequently 
require supplementary investigations if they are to be assessed, though in some 
cases, as in the results described below, they will be included in the sampling 
errors as ordinarily calculated. Failure of the chosen measures to represent 
what is really required is much more troublesome and the amount of the 
disturbance cannot ordinarily be assessed. (An example where this point may 
arise is considered in the next section.) 

An example of a case in which adjustments for attenuation were necessary 
is provided by a survey recently carried out in England and Wales on potatoes. 
In this survey the yields were estimated by taking a sample of about 35 fields 

er county and lifting and weighing small sample lengths of row from the selected 
fields (Boyd and Dyke, 1950, H’). The means of the sample yields and the official 
estimates (tons per acre) for the surveyed counties for the three years are shown 
in Table 9.10, and Fig. 9.10 shows the Ministry of Agriculture’s estimates 
of the yields of the surveyed counties plotted against the sample estimates 
The top broken line is the line on which all points should fall if there were no 
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TABLE 9.10—YIELDS (TONS PER ACRE) AND REGRESSION COEFFICIENTS IN THE 
POTATO SURVEY 


Sample yields (x) Official estimates (y) on aera co 


1948 9°35 7:69 0-365 0:457 
1949 7:52 6-30 0-520 0-606 
1950 9-48 7-59 0-415 0-530 


errors in either the official or the sample estimates. The sample estimates may 
be taken to be virtually free from bias, but they are subject to random sampling 


ul ae ee 


œ o 


OFFICIAL ESTIMATES (y) 
N 


SAMPLE YIELDS (x) 
V ETWEEN OFFICIAL ESTIMATES AND 

Fic, 9. JEY: THE RELATION BETW x 

ares pales OF COUNTIES FOR 1948, 1949 AND 1950 (TONS PER ACRE) 

errors owing to selection of fields within a county, and to the sampling of the 

selected fields. 

ency to underestimate counties 


isa tend 
aithe ca d quantitatively by calculating 


Ita th 
ppears from the figure be evaluate 


with high yields. This tendency ca? 
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the regressions of the official estimates on the sample estimates. The thin 
lines on the diagram give these regressions for the three years. The coefficients 
are given in Table 9.10. Since, however, the sample estimates are subject 
to random errors these regressions require adjustment if they are to represent 
the average values of the official estimates for given values of the true county 
yields. 
ý The thick lines on the diagram and the coefficients in Table 9.10 give these 
adjusted regressions. Apart from random errors of estimation they represent 
the lines that would have been obtained if the yields of all the fields in each 
county had been determined without error. The calculated regression line for 
1948, for example, was 
y = 7-69 + 0-365 (x — 9-35) (9.10.b) 

The average error variance per county was 0-398. This includes both the first- 
stage component due to selection of farms, the second-stage component due 
to the selection (where necessary) of fields on the selected farms, and the third- 
stage component due to the sampling of the selected fields. It is calculated from 
the within-counties variance of the mean sample yield per farm. The total 
variance of the county sample estimates was 1-968. Hence h = 0-398/1-968 = 
0-202, and b’ = 0-365/(1 — 0-202) = 0:457. The adjusted regression line is 
therefore 

y = 769 + 0-457 (x — 9-35) (9.10.c) 

It will be seen that part, but by no means the whole, of the apparent under- 
estimation of high yields can be attributed to random errors in the sample 
estimates. The lines for the three years have also been brought into somewhat 
closer agreement by the adjustments, for although the adjustments to the three 
regression coefficients are very similar, the lower mean yield of 1949 has raised 
this line relative to the others. 

In 1948 the estimates were provided by the Ministry’s Crop Reporters, 
but in 1949 and 1950 the duty of making estimates was transferred to the 
National Agricultural Advisory Service. The close agreement of the adjusted 
lines—the differences are no greater than would be expected from random 
errors—demonstrates the very similar behaviour of the two different groups of 
reporters. The line obtained by taking a weighted mean of these lines (by a 
procedure we need not describe) is 

Y = 7-19 + 0-563 (x — 8-78) (9.10.d) 

These results may be used to establish a formula for correcting official 
estimates in future years. This is not so simple a problem as it appears at 
first sight. If both official and sample estimates were available for all counties 
over a number of years, the regression of the sample yearly means z on the 
official yearly means Jı would provide the appropriate equation of estimation, 
at least in so far as future years could be regarded as a random sample from the 
same population as the years in which the samples were taken. No adjustment, 
for errors in x would be required, since %; is here the dependent variate. The 
present results, however, only provide data for a selection of counties for three 
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years, and the regression of %; on fr will therefore be too ill-determined to be 
of any value. Consequently we have to consider whether it is possible to 
establish this regression line indirectly. 

In the light of the results obtained we may tentatively assume :— 

(1) The official county estimates are distributed about the mean adjusted 
regression line of y on x given by equation 9.10.d with a residual variance 
estimated at 0-624.* 

(2) There may be an additional common component of error affecting all 
the official estimates of a particular year, but in view of the closeness of the 
adjusted regression lines of y on x for the three years this component is likely 
to be small. 

We shall also assume, in order to simplify the discussion, that the errors 
of a particular county about the regression line are independent from year to 
year. This is not likely to be wholly correct, since official estimates for a 
particular county are for the most part made by the same reporters in successive 
years. 

If «’ is the true mean yield of all counties for the year ¢, and if the common 
component of error under assumption (2) is negligible, the points (3, Fe) 
representing the yearly means will deviate from the regression line only by 
an amount due to the random errors arising from assumption (1). The official 
estimate for the country is a weighted mean of the county estimates, with 
weights w proportional to the county potato acreages. Using the county acreages 
for 1942 (which has a similar total potato acreage to 1950) we find 


S (w°) /{ S (w) } = 0-0302 
and consequently from formula 7.5.e the variance of yp about the regression 
line is given by 

V; (Ji) = 0-0302 x 0-624 = 0-0188. 

We also require estimates of the mean and variance of the yearly means 
of either the true yields or the official estimates. No reliable estimate can be 
obtained from the present data, since only three years are available, but one 
can be obtained from the values of the official estimates over past years. Taking 
the last 20 years for which data are readily available (1930-1949) we obtain 

Fu = 6-80 V (Ft) = 0:333. 
From the adjusted mean regression line (9.10.d) the corresponding mean 3’ 


for the true yields is 8-09. f A, 
If R’ represents the value of y for the point on the regression line corres- 


ponding to ;’ we have 
V (F) = V (F) + Vr (F) = b V (8) + Vr (7i) 
and hence 
* This is calculated in the ordinary manner except that b is replaced by b’ in the 


formula for Q (Section 7.12). 
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cov (8, 5) = BV (81) = | V — Vo) 


2 y (5) (1 — R) 


where k = Vr (j1)/V (Fi) = 0-0188/0-333 = 0-0565. 
The regression coefficient of 3’ on Jı (referred to the y axis) is therefore 
cov (1’, Fi) 1 
Va) TË 


= 1-676 
The regression equation passing through the point (Xt, Fr) will be 
Z1” = 8-09 + 1-676 (j, — 6-80) 
= 8-43 -+ 1-676 (51 — 7-00) (9.10.e) 
This is not shown in Fig. 9.10 but is almost the same as the 1949 adjusted 
regression. 

-If we had taken the regression of all the observed x on the observed y 
(disregarding the year classification) we should have obtained the equation 
(line F in the figure) 

t = 8-88 + 0-966 (F: — 7-27) : 
= 8-62 + 0-966 (F — 7-00) (9.10.£) 
This differs considerably in slope from the regression given ahove. If only 
the data for the years 1948 and 1950 had been available the difference would 
have been much greater, the line (line G in the figure) being l 
Zi’ = 9-41 + 0-593 (Fe — 7-65) 
= 9-02 + 0-593 (Ft — 7-00) (9.10.g) 
whereas the procedure giving equation 9.10.e would have given the line 
Si! = 7-64 + 1-953 (F — 6-80) 
= 8-03 + 1-953 (F: — 7-00) (9.10.h) 
This line is also not shown in the figure but is nearly the same as the 1950 
adjusted regression. The procedure therefore gives a relatively stable line. 

If there is an additional common component of error under assumption (2), 
the slope of the regression equation (relative to the y-axis) should be decreased, 
but since we have no means of estimating this component the amount of the 
decrease cannot be assessed. We might, however, adopt the simple compromise 
of taking the regression of şt on Jı passing through the origin. For this 
regression 

b = S (Kt H1)/S (JÊ) = 1-222 
Hence the line (line J in the figure) is 
& = 1-222 fı 
= 8-55 + 1-222 (4 — 7-00) (9.10.i) 
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Apart from any additional common component of error, thesuccess of equation 
9.10.e in future years will depend on how far the reporters continue to make the 
same type of error as they have in the past. If they become aware of their present 
tendency to underestimate high yields they may endeavour to improve their 
estimates. This will, of course, vitiate any adjustments based on equation 
9.10.¢. 

The reader will find it instructive to calculate the predicted values of *;/ 
for the three years for which data are available. 


9.11 The interpretation of multiple regression 


The interpretation of the results of multiple regression analysis requires 
the greatest care. Nothing is easier than to reach false conclusions. The first 
point to remember is that all regression and correlation analysis merely deals 
with associations. By itself it tells us little of the causative factors that are 
operating. Fortunately we are frequently in a position to make at least tentative 
assumptions about the actual causative system. When this is possible a regres- 
sion analysis can, under favourable circumstances, confirm or disprove our 
assumptions, and provide estimates in quantitative terms of the effects of the 
different factors. 

As a specific example of the types of problem involved we may take the case 
of a survey of housing conditions conducted with the object of finding the 
influence of such conditions on the health of the occupants. It was observed 
by M‘Gonigle and Kirby (1936) that the health of the inhabitants of “ an 
unhealthy area” of Stockton-on-Tees deteriorated when they were rehoused 
in a self-contained municipal housing estate, owing to the fact that families 
moving to better houses had to spend a greater part of their total income on 
rent, etc., to the detriment of their general living standards, and particularly 
of their nutrition. On the other hand, in an investigation in Newcastle-upon- 
Tyne during the depression, which revealed the alarming difference in health 
between children of working-class (largely unemployed) parents and those of 
middle-class parents, Dr. J. C. Spence (reported by M‘Gonigle and Kirby) 
came to the conclusion that the main factors responsible for the difference 
were 

(a) The housing conditions, which permit mass-infections of young 
children at susceptible ages. 
(b) Improper and inadequate diet, which prevents satisfactory recovery 
from their illnesses. 
He further states : “ It is probable that these two factors are of equal importance; 
but I would suggest that opinion on this matter should be reserved until a 
full inquiry, carried out by competent observers in a scientific manner, has 
studied the problem more closely.” 

We will consider how this situation is likely to be reflected in the results 
of a survey of a group of the population subject to different housing conditions 
but otherwise relatively homogeneous. For purposes of discussion we will 
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assume that information is available on income as well as on housing conditions 
and health, but that no information is available on standard of nutrition, etc. 
It is reasonable to suppose that income affects housing conditions in that those 
with larger incomes will be in a position to obtain better housing, but that 
housing conditions do not exert any appreciable influence on income. Both 
income and housing conditions may be expected to affect health, income 
operating through housing conditions which are observed, and through other 
factors, such as nutrition, which are not observed. If U represents total income, 
V housing conditions and Y health this causative system can be represented 
by the following diagram : 


U 
| N 
Y 

Y 

y, 
The arrow between U and Y here represents the “net” effect of income on 
health, ż¿.e. the effect of income other than that due to change in housing 
conditions. 

This leads to the concept of net income, i.e. income after deduction of rent 
and other charges associated with a given type of housing. This is the part 
of the income which is available to produce the net effect of income on health. 

To simplify the discussion consider only families of a given size and com- 
position. Take u to represent the total income of such a family, un the net 
income, v the index of housing conditions, and y the health index. If, within 
the ranges covered by the variates, the causative relations are linear with a super- 


imposed random component, the equations representing the above causative 
system may be written 
U — Vo = y (u — uo) + e; 49.11.a) 
Y — Yo = a (un — uno) + f (v — vo) + e, (9.11.b) 
where e, and e, are random components, the Greek letters are numerical 
coefficients, and wo, Uno, Yo, Yo represent a set of values of u, Un, v, y near their 
means which conform to the linear relationship defining the causative system. 
The coefficient a represents the average increase in health index that may 
be expected if incomes are raised one unit but people are prevented from 
spending any of this additional income on improvement of housing conditions. 
Similarly the coefficient # represents the average increase in health index that 
would result from an improvement of housing conditions if this improvement 
entailed no additional charges either direct or indirect on the occupier, The 
coefficient y represents the average increase in the housing-condition index 
that may be expected to result from unit increase in income. 
If information has been collected for the individual families on all the 
necessary points, the net income of each family can be calculated directly. 
In this case this variate should be used. Here, however, we wish to consider 
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the case in which detailed data of this type are not available, but sufficient 
information is available to make an estimate of the average charge k on the 
total income resulting from a unit increase in the housing-condition index. 


We then have 


Un — Uno = (u — uo) — k (u — vo) + ez (9.11.c) 
The equation for y may now be written 
Y — yo = a (u — uo) + (B — ka) (v — vo) + ez + a es (9.11.d) 


The basic data will be in the form of observations on the three variates, 
u, v and y. For purposes of analysis such data may well be condensed by 
grouping over income classes and over housing conditions so as to form a 
two-way table with income and housing condition as the two classifications, 
the entries in the table being the mean health indexes for all the families 
belonging to that cell. An auxiliary table giving the number of families in 
each cell will also be required. 

If, now, we calculate the multiple regression of y on u and v we shall have 
an estimate of the constants of cquation 9.11.d. If this regression is 


y= 7 +a (u — u) +b (v —2) (9.11.e) 
a provides an estimate of a and b of f — ka. Consequently the direct effect of 
housing f is estimated by b + ka. 
If also, the regression of v on u is calculated, and found to be 
v=o+tc(u—u) (9.11.f) 


c provides an estimate of y. 

The total regression of y on u, 

y =F +a’ (u — ü) (9.11.g) 
is also of interest, as will be explained later. 

Wheh this causative system is operating, therefore, the partial regression 
coefficient b of health on housing conditions, with total income as the second 
independent variate, does not estimate the direct effect of change of housing 
conditions on health, It represents the net effect, which is the difference in 
this direct effect and the effect of the reduction of other aspects of the standard 
of living due to having to spend more on housing. If there is compulsory 
improvement of housing, by slum clearance schemes and the like, of amount 
ôv, without improvement of income either direct or indirect (e.g. by rent 
subsidies), an improvement of health of bdv may be expected. On the other 
hand, if the full additional cost of the housing to the occupiers is covered 
by subsidies or other means, an improvement of (b -+ ka) ôv may be expected. 

If incomes are raised by an amount 6u and the situation is such that housing 
conditions and rents cannot change, an improvement in health of adu may be 
expected. If, however, the increased incomes are allowed to produce their 
natural effect in improving housing conditions, this improvement may from 
equation 9.11.f be expected to amount to côu. The expected improvement 
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in health, using equation 9.11.d, will then be 
a (ôu — ke ĝu) + (b + ka) côu = (a + be) ĝu 


fa to SEPETI) bu 


S (u — u} 
a S (u — u} + b S (u — u) (v —2) bu 
FA S (u — u} 
_ SUA _ a 
-e ĝu =a’ ĝu 


the last line being derived from equation 9.9.b. The required increase is 
therefore given by the total regression of y on w, as might be expected. 

The above interpretation can be accepted without qualification only if the 
causative system really conforms to the postulated model. In practice there are 
likely to be departures from such a simple model, many of which will introduce 
serious disturbances which may entirely vitiate the conclusions. 

In the first place there will usually be external causative agents, not included 
in our regression system, which affect the various variates. In the above example, 
for instance, the level of the education of the adult members of the household 
may be expected to affect income. It may also, to a less extent, affect housing 
conditions (apart from influence due to income). Provided education does not 
affect health directly, these influences will not disturb the partial regression 
coefficients of y on wu and v or their interpretation, but they will affect the 
regression coefficient of v on u. This latter will now represent the sum of the 
effects of a direct increase in income and of the associated average increase in 
educational level. The coefficient will therefore no longer give an estimate of 
the increase in housing conditions that may be expected from a rise in income 
level of an individual whose educational level remains unchanged. 

If, on the other hand, educational level affects health directly, as Well may 
be the case, the partial regression coefficients of health on income and housing 
conditions will be similarly disturbed. 

These disturbances can theoretically be eliminated and the effects of edu- 
cation measured if an appropriate measure of education is available. All that 
is necessary is to include a term for education in the regression system. In 
practice, however, it is not possible to eliminate all disturbances of this kind, 
because of the number of variates that may be involved, because some of them 
may not be measured or may be unmeasurable, and because correlations between 
them prevent the separation of their effects. 

A further complication which affects interpretation of regression equations 
arises when there is a two-way causal relationship. In the example considered, 
ill-health, if continued over any long period of time, may undoubtedly be 
expected to depress income. Consequently the association between income 
and health arises not only from the influence of income on health but also from 
the influence of health on income. The only way of attempting to assess the 
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magnitude of disturbances of this kind and to eliminate their effects is by more 
detailed observations of different parts of the material. If, for example, the 
health index is calculated from the health of the occupants of the house other 
than the wage-earner it may be hoped that the components of this index which 
affect income will be much reduced. 

Errors in the independent variates can also affect the results. In so far 
as these are random and their variances and covariances are known they can 
be allowed for by the method of Section 9.10. But this will not be the case 
in the problem we are considering. There will in fact be many aspects of 
housing which may affect health, often in different ways. These may be 
inadequately recorded (and indeed exact assessment of some aspects may be 
almost impossible) or the index may be imperfectly chosen.* Our regression 
will then not measure the full effect of housing conditions, but only the effect 
of such conditions as are correctly summarised in the index. 

This complicates the issue in another way. If income is accurately measured 
but housing conditions are inaccurately measured and the causative system 
shown above is operating, some of the effect that should be attributed to housing 
conditions will appear as a direct income effect. In the extreme case, for 
. example, where the chosen housing index bears no relation to the housing 
conditions which affect health, and is uncorrelated with income, the partial 
regression coefficient of y on v will be zero, and that of y on u will be equal 
(except for random errors) to the total regression coefficient of y on u. 

When the chosen index measures some aspect of housing which is closely 
correlated with income but which does not affect health, the total effect of 
income wili be divided between the partial coefficients of y on u and v ina 
manner which depends to a large extent on the chance distribution of error. 

From the above discussion it can be seen that the use of the regression 
method in the interpretation of survey data is fraught with hazards. In part 
these hazards arise from the fundamental weaknesses of observational material 
stressed at the beginning of Section 5.23, but in part they can be attributed 
to an over-simplified approach to the problem. Housing conditions can vary 
in many ways, and ill-health can take many forms. The more precise and 
detailed the observations, the more relevant the quantities observed to the 
causal systems believed to be operating, and the greater our knowledge of the 
causal systems themselves, the more confidence can we have in our conclusions. 
Thus, any detailed analysis of the effect of general housing conditions on 
health generally must be extremely tentative, and may well, in the light of the 
above discussion, be judged to be not worth while. On the other hand, if 


we are dealing with a specific disease, such as dysentery, known to be spread 


under insanitary conditions, and if we can get a direct measure of these in- 
closely 


sanitary conditions and can show that the incidence of dysentery is 
r choosing the best index. 


of the index as separate 
1 work involved in 


ier 

Inf; Given ample data, statistical procedures are available fo 
act, all that is necessary is to include the different components 

terms in the regression equation, but the additional computational 

such a procedure will not ordinarily be justified. 
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related to them, then we shall feel ourselves on much safer ground in drawing 
the conclusion that if these conditions are remedied we may expect to see a 
considerable fall in the incidence of dysentery. 

If, at the same time, our investigation is extended to other diseases, and 
these are shown to be related to the types of condition that are known to favour 
them, our confidence in the validity of all our conclusions will thereby be 
strengthened. Measuring a single association and drawing an isolated con- 
clusion from it does not, in fact, constitute good investigational work. Such 
work is much more of the nature of a detective inquiry, in which all the separate 
pieces of evidence are assembled and fitted together. If and only if they form 
a coherent picture are we entitled to have confidence in our conclusions. 
Statistically, therefore, such work is much more difficult, and requires much 
more critical ability, than the analysis of experimental data, where the effects 
of separate factors are deliberately isolated in the planning of the experiment. 

In the above discussion we have considered the simple case of a regression 
analysis with two independent variates. If the data are extensive the regression 
of health on housing conditions can be calculated separately for each income 
level. The partial regression coefficient b will be a weighted average of these 
regression coefficients. Similarly the partial regression coefficient a is a 
weighted average of the coefficients of the regression of health on income for 
different levels of housing conditions. The multiple regression method, 
therefore, provides an automatic averaging of the separate regression lines for 
different parts of the data. If examination of the data indicates that there are 
real differences of a meaningful nature in the separate regression lines, such 
averaging will be inappropriate. 

We have thought it worth while to consider a specific case of regression 
analysis in some detail because of the very real dangers of misinterpretation 
in investigational work. | We have taken the case where the underlying causal 
relationship may be considered to be of bivariate linear form. The same 
considerations and qualifications apply when some or all of the variates are 
qualitative. Indeed the analysis of such data by the method of fitting constants 
(exemplified in Section 5.24) is very similar in principle to multiple regression 
analysis. If the data, whether quantitative or qualitative, are extensive the 
necessary examination can frequently be made by comparisons of the means 
of various groups instead of formally calculating regressions or fitting constants 
(see Sections 9.6 and 9.8). Whatever the method employed, however, the 
interpretation of the results is governed by the same principles. 
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MISCELLANEOUS DEVELOPMENTS 


10.1 Machine developments 

There is not space here to give an account of all the most recent developments 
in tabulating machinery. Improvements and modifications are continually 
being introduced by the various machine companies. Moreover the development 
of the machines of the different companies of the Hollerith group is following 
somewhat different lines. The description given in Sections 5.11-5. 16 refers 
to British machines and is not applicable in detail to the machines of other 
companies. ‘The companies should therefore be consulted when any large 


tabulating job is to be undertaken, particularly if the installation of new equip- 


ment is contemplated. 


There are, however, two developments in punched card machinery, which 


have proved of particular value in census work, and which may therefore be 
mentioned here. The first, mark sensing, was briefly referred to in Section 5.9. 
The view there expressed that it was not likely to be of much value in census 
work has proved to be incorrect. It is, for example, successfully used in the 
Canadian Labour Force Census which is carried out on a sample at intervals 
of three months (Keyfitz and Robinson, 1949, F’). In ordinary mark sensing 
specially printed Hollerith cards are used. For each column to be punched a 
mark is made in one or more of twelve positions, and the cards are subsequently 
passed through a machine which reads these marks and punches the information 
on the same card. In the Canadian Labour Force Census special record cards 
are used, both sides being marked by the field officer. The only coding done Ta 
the office, which is also recorded by mark sensing on the original card, is the 
classification of the worker by industry and occupation. The completed cards 
are then passed through a special machine which reads the recorded information 
and punches it on an ordinary Hollerith card. Checks for inconsistencies one 
made at the time of punching by mechanism of the type described in the next 
paragraph incorporated in the machine. It is found that the use of mark sensing, 
amongst other advantages, results in a clear saving of two weeks in the processing 


of the data. i : 5 F yi 
The other development which is of interest 1s the introduction of techniques 
and equipment for more effective checking for inconsistencies (mechanical 
n 5.20. Machines are now available which will 


editing) referred to in Sectio : : h ; ; 
check for inconsistencies (including inconsistent relationships between different 
fields) on any of the 80 columns of the card. Sometimes, depending on the 


complexities of the inter-relationships being checked, or as many as 
twenty-five columns can be simultaneously examined. Cards pass through 
these machines at a rate of 450 per minute and those which fail the consistency 
checks are automatically separated from the others. Machines for tabulations 
of the census type are also continually being improved and elaborated. 
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It may also be mentioned here that the collator (Section 5.16) can be fitted 
with a card counting device. Amongst other uses this enables every gth card 
of a pack to be picked out (g being an integer not greater than 99). This can 
be of value in extensive systematic sampling from a pack of cards. 

In addition to developments in punched card machinery (which themselves 
include the use of electronic devices) an entirely new field is being opened up 
by the development of high-speed electronic and relay calculators. So far, 
most of these machines have been built mainly for the purpose of undertaking 
the type of calculation required in the mathematical and physical sciences, 
but in the near future they may well come to play an important part in 
statistical work. UNIVAC was specifically built for the purpose of analysing 
the U.S. Census of Population. It recently demonstrated its ability in another 
statistical field by predicting the results of the 1952 U.S. Presidential Election 
with high accuracy from the first few polling returns that became available—a 
result that was so unexpected that its publication was withheld ! 


10.2 Methods of taking a stratified sample from a list or card index 


Although there is no theoretical difficulty in taking a stratified sample 
from a list or card index, the practical difficulties are often considerable if the 
register is at all large. In this section various alternative procedures are 
discussed. j 

As an example we will consider alternative methods of taking a sample of 
vehicles from a register of 100,000 vehicles which can be stratified into 6 strata 
with numbers approximately as in Table 10.2. 


TABLE 10.2—NUMBER OF VEHICLES IN A REGISTER 


Type of operation 0-10 cwt. 11-20 cwt. Over 20 cwt. | 
For hire 7 . | 20,000 (50) | 20,000 (200) 8,000 (160) 
On own account . | 40,000 (100) | 10,000 (100) 2,000 (40) 
Sampling fraction . 1/400 1/100 1/50 


It is desired to take a stratified sample with a variable sampling fraction 
having the values shown. The numbers in brackets give the numbers in the 
sample (total, 650). 


(1) Cards arranged (or easily arrangeable) in the required strata 


In this case, every qth card of a stratum will be taken, 1/q being the required 
sampling fraction. It is best to start with a random number between 1 and q. 
Thus for the stratum of over 20 cwt. on own account we should take every 
50th card out of the 2000, starting with a random number between 1 and 50. 
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(2) Numbers in strata known, but cards not easily arranged in strata 


A preliminary sample of every qth card can be taken, where 1/q is somewhat 
larger than the largest sampling fraction fy. A reasonable rule for determining 
q is as follows :— 

If the number required in the stratum with the largest sampling fraction 

(or the smallest of these strata if there is more than one) is 7g, and the total 

number in the stratum is No, 4 should be chosen as the next convenient 

whole number less than (or equal to) g’, where 


= Ne (n +1) — v (2n + D). 
No } 


There is then a chance of about 9/10 of getting the required number or 

more in this stratum in the preliminary sample. 

In our example Ny = 2000, % = 40. Hence g’ = 40 and consequently 
q = 40. 

The cards of the preliminary sample must be withdrawn, sorted by strata 
and counted. For each stratum the requisite fraction is then taken or rejected 
(at random or by selecting every rth card) so as to obtain the number required 


in that stratum in the final sample. i ‘ 
If the register is in the form of a list, the numbers in the different strata 


in the preliminary sample must be marked and counted, or copied in a new list, 
and the requisite fraction then selected as above or by the method of simultaneous 
selection given in Section 3(a) below. _ ; 

The above procedure is not satisfactory if any of the sampling fractions are 
large, particularly if the corresponding strata are a small fraction of the whole. 
All the members of such strata should be picked out by examination of all 
cards. The remainder can then be dealt with as above. 


(3) Numbers in strata not known, 

(a) Simultaneous counting out of strata. 

Since the sampling fractions are known, we can keep running counts of all 
strata simultaneously, going through the register card by card. Every gth card 
of a particular stratum is withdrawn as it is reached. This method requires 
considerable care to avoid misclassification in the counting. It is best for two 
people to co-operate, one calling out the classification, and the other keeping 


the counts. 
(b) Complete count as (a) but ie 
be selected as in (2). 
_(c) Count of strata from 
being selected as in (a). 


‘Thus in the example, ife 
falling in the two smallest siz 
case of two-phase sampling. 


and sorting troublesome or impractical 


thout selection of the sample, which can then 


every rth card only, every g/rth card of a stratum 


very 10th card is examined, every 40th of those 
e-group strata will be selected, etc. This is a 
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(d) As (c) but picking out every rth card, sorting into strata, and sampling 
as in (1). 


In the case of methods 3(c) and 3(d) certain additional errors due to the 
two-phase sampling will be introduced. We therefore require to determine 
the fraction of cards which should be taken at the first phase. 

Suppose that in our illustrative example we have the following effective 
variances per vehicle (z.e. the quantities such that the error variance of the mean 
of a sample of n is the effective variance/n, the sampling fractions being taken 
to be small) : 


Type of sample Effective variance 
Random P : . a 5 = 4 è - 100 
Stratified with variable sampling fraction . 3 5 z 10 


We shall then have the following error variances with samples of 650 vehicles : 


‘Type of sample Error variance 
Random : f f $ 5 s - 100/650 = 0-15 
Stratified with variable sampling fraction . š 10/650 = 0-015 


Two-phase, every 10th card— 
Ist phase s : š z = . 100/10,000 = 0-010 
2nd phase . . : : . . 10/650 = 0-015 


Total -i : 5 à S 5 ; a . 0:025 


Two-phase, every 5th card— à 


Ist phase c s : a È 100/20,000 = 0-005 
2nd phase è a A 7 J re 10/650 = 0-015 
Total é 3 A 2 S è ‘ - 0-020 


The fraction included at the first phase should be related to the ratio of the 
amounts of work required to select the sample and to collect the information 
after selection. With every 5th card, information will have to be collected 
from one-third more of the vehicles (-020/-015 — 1) than if a complete count 
of strata were made. Proper figures for the variances must, of course, be 
obtained from actual data before reaching any conclusions. 

The two-phase method can profitably be used when considerable extra 
work is entailed in obtaining particulars for stratification, such as looking up 
yehicles in a supplementary register. 
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10.3 A method of compensating for incomplete ascertainment in a 
stratified sample with uniform sampling fraction 


If information is lacking on certain units of a stratified sample (due for 
example to non-response), the raising factors of the different strata can be ad- 
justed so as to preserve the correct weighting between the different strata. 
As has been pointed out in Section 5.22 this will only properly compensate 
for the missing units in so far as these are similar, in the variates concerned, 
to the remaining units of the stratum in which they fall. If this is the case, 
and if the differences between strata are large and the proportion of missing 
units varies considerably from stratum to stratum, an adjustment of this kind 
may substantially improve the accuracy of the estimates. 

With such an adjustment of the raising factors, however, the simplicity of 
analysis that exists with a sample with uniform sampling fraction will be lost. 
This is a matter of some importance in large-scale surveys with many variates, 
particularly when punched-card machinery is used. 

A simple method of avoiding this difficulty is to reject units at random from 
the strata with the smaller proportions of missing units, the numbers rejected 
being so chosen that after rejection all strata have the same proportion of missing 
units. This, however, will result in loss of information which will be consider- 
able if a few of the strata have relatively large proportions missing. 

An alternative, which does not entail much extra work when punched card 
machinery is used, is to reject units (by the selection of cards at random) from 
the strata with the smaller proportions of missing values, and to duplicate 
cards of units selected at random from the strata with the larger proportions 
of missing values, the numbers rejected and duplicated being so chosen that 
the strata are finally represented in the correct proportions. 

As a simple example, we may consider a stratified sample with four equal 
Strata, each of 1,000 units. Information is available on 80, 85, 90 and 95 per 
cent. of these units in the four strata. If we reject and duplicate so as to bring 
all Percentages to 90, i.e. 900 units per stratum, we shall require to duplicate 
100 cards from the first stratum and 50 from the second, and to reject 50 cards 
from the fourth stratum. The mean of the first stratum will then be given by 


J= {S 0) +2 S1” (y)}/900 
where S’ and §” indicate summation over the non-duplicated and duplicated 
units respectively. Consequently, if the population is large and the variance 
per unit in the first stratum is o,”, the sampling variance of the mean of the first 
stratum will be 


V (71) = {700 0,2 + 4 x 100 o,2}/900? = 0-00136 o,2 
compared with the variance o,2/800 = 0-00125 o,? of the unweighted stratum 
mean. The variance of the mean of the second stratum can be calculated 


similarly. If the variance is the same within each stratum this equals 


53/000. 9,°. The variances of the means of the last two strata will each be 
a? t 
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The estimate of the mean of the population will be given by } of the sum 
of the means of the strata, and its variance will consequently be {y of the 
sum of the variances. This is found to be 0-000301 «,*. The properly weighted 
estimate (i.e. the mean of the strata means without rejection or duplication) 
has a variance of 
1 irs i 1 1 a 4 
16 (en F 350 + 900 + 5) EITE de 
The estimate obtained by rejecting units from all strata except the first, so that 
800 units are retained in each stratum, has a variance of o,°/3200 = 0-000312 o,?. 

Consequently 4-6 per cent. of the information is lost by combined rejection 
and duplication, and 8-0 per cent. by rejection only. These losses are additional 
to the losses due to missing information. 

The loss in actual cases that have to be dealt with can be calculated in the 
above manner. If there is doubt about the most appropriate level of rejection 
the losses with different levels of rejection can be determined and compared. 
It will be found that in general there is least loss of information if rather more 
units are duplicated than are rejected. 


10.4 Optimal allocation for more than one variate 


When an estimate of the population total or mean of a single variate is 
required, the optimal sampling fractions for a stratified random sample with 
variable sampling fraction are given by the equations of Section 8.17 (a). 
If estimates for two or more variates with different within-strata variances are 
required, these equations will give different values of the sampling fractions. 
It is often asked how the sampling fractions should be chosen in such 
circumstances. 

Suppose there are two variates which are denoted by y and y’ with within- 
strata variances oj” and o;”. Three cases arise. Firstly, if sampling fractions 
are chosen which are optimal for Y it may be found that Y” is estimated, with 
more than the required accuracy. Secondly, Y may be estimated with more 
than the required accuracy when the sampling fractions are optimal for Y’. 
Thirdly, neither of these conditions may hold. In the first two cases no problem 
arises, since we choose the sampling fractions which are optimal for Y or Y^ 
respectively. In the third case, as mentioned in Section 3.5, sampling fractions 
which are sufficiently nearly optimal can often be determined by compromise, 
without any exact investigation, but a formal solution is possible on the lines. 
of Section 8.17. 

Without imposing additional conditions we cannot minimise the variance 
for a given cost, since two variances are involved. We can, however, minimise 
the cost for given values of V(Y) and V(Y’). If these values are denoted by 
A and A’ and the cost is C, we have 

V(Y) = 2o? (1 — fi) NP /mi = ENio?(gi—1)=A  (10.4.a) 
V(Y’) = 267 (1 —fi) N? [ni = 2 Ni oi (gi — 1) = A’ (10.4.b); 

C= Derm ‘ 
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We require to minimise C subject to conditions 10.4.a and 10.4.b. Following 
the procedure of Section 8.17 we may add multiples L and L’ of these conditions 
and differentiate with respect to each mj in turn. Equating the differentials to 
zero we obtain the equations 

ci —Lo?/f2 —L' oi? /f2 = 
Hence 

(Lo + L’ oj” 

ae a 
Extension to three or more variates merely requires the insertion of additional 
terms L” o;’’, etc., in the numerator. 

Substitution for the gi (= 1/fi) in equations 10.4.a and 10.4.b will give 
two simultaneous equations for L and L’. Unfortunately these equations 
cannot be easily solved. A solution by successive approximation on the lines 
of the example below is therefore necessary. 

If we put L = 2/K and L’ = 4’/K with 2 + 2’ = 1, so that L + L’=1/K, 
we have the alternative form 

l1 V(Ao? + A’ si” 
i= yK vci 
The numerator is now a weighted mean of o;* and o;’*, and the analogy with 
equation 8.17.b will be apparent. 

Putting /ci/4/ (Aoi? + 2/c;"*) = gio we have gi = gioy/K, and therefore, 
from equations 10.4.a and 10.4.b, 

VK . È Ni oi? gio = V (Y) + 2 Nii? (10.4.e) 
VK . ENI oi” gio = V (Y’) + Z Ni oi? (10.4.£) 

For any assumed value of 4 the value of 1/K which gives V (Y) = A can 
therefore be determined from equation 10.4.e. The value of V(Y’) for this 
value of 4/K can then be found from equation 10.4.f. The value of 4 which 
gives V(Y’) = A’ under these circumstances can thus be determined by 
successive approximation, taking different trial values of 2. This provides an 
alternative method of solution. If desired the roles of y and y’ can be inter- 
changed in this process. With more than two variates this alternative method 
of solution is preferable, since one dimension is thereby eliminated. 

Equations 10.4.e and 10.4.f are also of use when trial values of L and L’ 
are taken directly. With such trial values neither V (Y) nor V (Y^) will have 
the required value. We can, however, adjust both L and L’ by a factor 0, 
so chosen that neither V (Y) nor V (Y’) exceeds its required value. The sampling 
fractions will then all be multiplied by a factor +/ 0. The value of 4/0 is given 
by equation 10.4.¢ or 10.4.f (the larger value being taken) with ./K replaced 
by 1/+/0 and gio by the raising factors calculated from the trial values of Z and I’. 

If any of the fi are found to be greater than unity the solution must be 
revised, as indicated in Section 8.17 (a). AY ie i 

The problem is also capable of re-formulation in terms of the minimisation 
of costs plus losses and in this form has a direct solution. If, following Section 


(10.4.c) 


(10.4.d) 
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8.18, the expected losses are taken to be a V (Y) and a’ V(Y’) and we minimise 
the costs plus losses, we shall obtain the equations 10.4.c with L and L’ re- 
placed by a and a’. If a and a’ are known, these equations can be solved 
immediately. 

If a and a’ are not known but the ratio between a and a’ can be assumed, 
we have 2 = a/(a + a’) and 2’ = a’/(a + a’), and a value of 4/K can be chosen 
which gives the required general level of accuracy, using equations 10.4.e 
and 10.4.f. In a survey to determine the amount of unemployment in various 
industries stratified by districts, for example, it might be reasonable to attach 
equal importance to errors in the unemployment totals proportional to the 
square roots of the numbers employed in the industries, in which case a, a’, 
etc., would be taken inversely proportional to these square roots. It should 
be noted, however, that this does not imply that the actual standard errors 
will be in the ratio of these square roots. he ratio obtained will be given by 


equations 10.4.¢ and 10.4.f and may differ substantially from the ratio of 
the square roots. 


Example 10.4 


Using the data on Hertfordshire farms described in Section 3.7, etc., and 
the size-groups of Example 6.7, determine the sampling fractions which will 
be optimal for simultaneously estimating the number of farms growing wheat 
and the total wheat acreage, each with a standard error of 5 per cent. The cost 
per farm is to be taken to be the same for all strata, i.e. all ci = 1. 


We require the numbers of farms in each stratum, the proportion of farms. 
growing wheat in each stratum, and the within-strata variances of the acreages 
of wheat of individual farms. Composite estimates from the various tables 
given in the preceding examples have been made. (The actual values could, 


TABLE 10.4,a—HERTFORDSHIRE FARM DATA 


Size- 

Stratum group N; Pe sê si? Nis? Nis? St s 
l 1— 440 0 0 0 0 00 0 
2 6— 520 04 -0384 2 200 1040 -19596 1-414 
3 21— 360 2 -16 15 57-6 5400 -4 3-873 
4 5l1— 620 6 -24 160 124-8 83200 -48990 12-649 
5 i5i= 400 8 16 650 64 260000 -4 25-495. 
6 301— 210 -9 -09 1700 18-9 357000 -3 41-231 
7 501— 50 10 0 4500 © 225000 0 67-089: 


2500 285-3 931640 
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of course, have been calculated from the original data, but these were not 
readily available.) These are shown in Table 10.4.a, where p; denotes the 
proportion of farms growing wheat and s;”* represents the within-strata variances 
of the wheat acreages. There is no need to describe how the estimates were 
obtained. 

From the table the estimate of the number of farms growing wheat, denoted 
for convenience by Y, is 

Y = D Ni pi = 963:8. 
A standard error of 5 per cent. will therefore be equivalent to a variance of 
48-19? or 2322. The total wheat acreage is 44,676 acres (Section 3.7) and the 
corresponding variance is therefore 4,990,000. The values of s;? for numbers 
of farms are given by the equation 
si? = pi qı 
These values are tabulated in Table 10.4.a. 

As a first step the sampling fractions (or raising factors) which are optimal 
for each variate separately should be calculated. Following the method of 
Section 8.17 (a) we require to calculate © Ni si /ci and = Nisi? and the 
similar functions for wheat acreage. Nj s; and Ni si’*, which are required in 
the subsequent calculations, are tabulated in Table 10.4.a, as are sj and sj’. 


We then find 

£ Ni sy o/ci = 723-65 EN; si? = 285-3 

E Ni si//ci = 80918 Z Ni si’? = 931640 
Hence from equation 8.17.c we have, for the optimal sampling fractions for 
number of farms, 

1/\/K = 723-65 /(2322 + 285-3) = 0-2775 
Ly = 1K = ‘07701 

The Salies of fi may now be calculated from equation 8.17.b, using the values 
for s; given in Table 10.4.a. These values are shown in Table 10.4.b. 


TABLE 10.4, b—OPpTIMAL SAMPLING FRACTIONS FOR NUMBER OF FARMS, 


FOR ACREAGE, AND FOR BOTH VARIATES SIMULTANEOUSLY 


Stratum No. of farms Acreage Both variates 
2 -0544 0074 0478 
3 -1110 +0202 0981 
4 1359 -0660 1268 
5 1110 1331 1313 
6 0832 2153 1603 
7 0 3502 2324 
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Similarly for the optimal sampling fractions for acreage 
1/v K’ = -005221 
Ly’ = -00002726 
giving the values of fi for acreage of Table 10.4.b. 

It is clear that neither of these two sets of sampling fractions gives the 
requisite accuracy in the other variate. Something intermediate is therefore 
required. As a first trial (a) we may take L = -04 and L’= -000013, ie. 
roughly half Lọ and Lọ’. The intermediate calculations for V (Y) and V (Y’) 
are given in Table 10.4.c. The values for gi can be calculated on the slide 


TABLE 10.4.c—VALUES OF fi AND gi FoR L = 04, L’ = -000013 


Stratum Ls? + L's? Fs gi 
2 -001562 +0395 25-317 
3 -006595 “0812 12-315 
4 -011680 -1081 9-251 
5 -014850 "1219 8-203 
6 -025700 -1603 6:238 
7 058500 -2419 4-134 
118887 


rule, using equation 10.4.c. From equations 10.4.a and 10.4.b we then find 
V (Y) = 2728 
V(Y’) = 5220000 
Both of these values are too high. Further trial values: (b) L == -06 
L’ = -000013 and (c) L=-06, L’= -000010 were therefore taken. The 
values of V (Y) and V (Y’) obtained were 
(b) L=-06, L’= -000013 (c) L= -06, L’= -000010 
V (Y) = 2274 V (Y) = 2332 
V (¥’) = 4812000 V (¥’) = 5301000 
If three points A, B and C with coordinates corresponding to the values 
(a), (b) and (c) of L and L’ are plotted, the points P and Q on the lines AB and 
BC where V (Y) is estimated to have the required value 2322 can be determined 
by linear interpolation. ‘The line PQ then represents an approximation to the 
curve of values of L and L’ for which V (Y) has the required value. A similar 
line can be drawn for V (Y ’). These two lines are found to intersect at the point 
— -059, L’= -0000120. A check computation of the values of V (Y) and 
y (Y’) for these values of L and L’ gives 
v (Y) = 2310 V(Y’) = 4972000 
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Hence L and L’ have been determined with all necessary accuracy. The 
resultant optimal values of fi are included in Table 10.4.b. 

The alternative method of solution using 4 and 2’ gives similar results and 
is left as an exercise to the reader. 

It is worth noting that all three sets of trial values give sampling fractions 
which, after adjustment so that neither variance exceeds its required value, 
are reasonably efficient. In order to compare the efficiencies the sampling 
fractions given by the chosen values of L and L’ must be adjusted by multipli- 
cation by 4/0, calculated as explained above. After these adjustments the 
numbers of farms in the samples and the relative efficiencies are found to be: 


x No. of farms Relative efficiency 
(a) -000325 231-7 96-4 
(b) -000217 223-9 99-8 
(c) -000167 230:5 96-9 
-000203 223-4 100 


Optimal 


10.5 The double-ratio estimate 

The ratio method is applicable to populations in which the ratio of two 
variates, y and’ x, is less subject to variation than either variate separately. 
The method can be extended to the case in which there are four variates y4, Xy 
Yo) X such that the ratio of the ratios y;/x, and Yə% is less subject to variation 
than either ratio separately. ‘This extension 1s due to N. Keyfitz * and 
has been applied by him to the estimation of the total labour force, wages, 
salaries, materials used, etc- in the case in which there is an initial complete 
census of production (denoted here by xı), and of the labour force (denoted 
by y,), etc., and subsequently a further complete census of production, X2, 
but a sample only for labour force, Ya, etc. In this case it may be anticipated 
that’the production per worker in a factory, though it may vary from factory to 
factory, and from period to period in the same factory, will increase or decrease 


in much the same ratio from period to period for all factories. 
The required estimate of the total labour force on the second occasion is, 


in the case of a random sample, 


The variance of Y, is given by the approximate formula 


Yè 2 
VAO AN 
1 o 


wher ee. a = = 
E st eel ERM GA eek games! 


*T ami Keyfitz for permission to publish an account of this method, 
which he pecebted tea He a festare given at the London School of Economics. 
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The parallelism with the formulz for the single-ratio estimate already given 
will be apparent. The extension to the case of stratified sampling is similar 
to that followed in the case of the single-ratio estimate. 

In the case of the labour-force estimate Keyfitz found that the squares of 
the coefficients of variation, i.e. (S.E. of estimate/estimate)*, for the double- 
ratio estimate and the two possible single-ratio estimates were as follows :— 


Double-ratio estimate m E ase ce = 0-0012 
S (vs 


Single-ratio estimate, X, ans es £ = 0-040 


Single-ratio estimate, Y ce = 0-303 


1 “ne a 
S (yı) 

The double-ratio estimate is therefore in this case much more accurate than 

either of the single-ratio estimates. 


1).6 Relative precision of biased and unbiased estimates 


The use of biased estimates introduces errors (due to the bias) which may 
be large relative to the random sampling errors. Nevertheless we often have 
good grounds for believing, either on the basis of previous experience of the 
material under consideration, or from detailed statistical analysis, that the 
errors due to bias are actually small, particularly when comparisons between 
different domains of study are required. If, therefore, a large reduction in the 
random sempling error is effected by the use of a biased estimate this may be 
preferable to the unbiased estimate. The relative precision of biased and 
unbiased estimates (ż.e. the reciprocal of the ratio of their respective sampling 
variances) can be determined by estimating the variances of the two types of 
estimate by the methods appropriate to the estimates in question. In the 
particular but important case in which the unbiased estimate consists of a 
weighted mean of the observed values x with weights w, and the sources of 
variation are such that all z can be regarded as subject to the same variance, 
the unweighted mean of the values will provide an estimate of the mean which 
has minimum variance. The ratio of the variances of the two estimates will 
then be, from formula 7.5.e, 


1 | S (u?) (mean of w)? 
{S(w)} — mean of w? 


n 


Example 10.6 


An estimate of rabbit damage to the wheat crop of a county is made on a 
random sample of farms. On each selected farm one of the fields growing 
wheat is selected with probability proportional to the area of the field. The 
damage is estimated by comparing fenced and unfenced areas. If the distri- 
bution of wheat acreages is that given in Table 7.2 what is the relative precision 
of the weighted and unweighted estimates ? 
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In this case the components of variation are such that the unweighted 
mean of the losses per acre on all sampled fields is likely to provide the estimate 
with approximately minimum error variance. This estimate will, however, 
be biased if the loss per acre is correlated with the area of wheat on the farm. 
An unbiased estimate will be provided by a weighted mean of the losses per 
acre, the weights being proportional to the areas of wheat on the sampled 
farms. The farms not growing wheat must be excluded from the calculation, 
since they are automatically excluded from both the weighted and unweighted 
means, even if they are included in the original sample. From the results 
given in Example 7.2.b we have 

n’=45, S(y)=2301, S(y*) = 207,261 
where n’ is the number of farms growing wheat. Here w= y. Hence the 
relative precision is 
1 [20201 

al (aso: ~ 0568 
The biased estimate has therefore nearly twice the precision of the unbiased 
If it were possible to select farms with probability proportional to 
area of wheat grown (one field, selected with probability proportional to area, 
being sampled on each selected farm), the unweighted mean would provide 
the unbiased estimate. The reciprocal of the above fraction, 1-76, therefore 
gives the advantage, when an unbiased estimate is required, of _using this 
method of sampling, i.e. of giving each acre an equal chance of being selected 
for assessment of loss instead of taking a random sample of farms. A method 
of selecting a sample which approximates to this requirement is described in 
Section 10.8. It may be noted here that the relative precision of the two 
milar to that found in Example 7.17 for the rather similar 
plication of fertilizers. In that case the value was 1/1-91 = 


estimate. 


estimates is very si 
case of the rate of ap 
0-524, 


10.7 The sampling error of an estimate of bias 


In Section 7.23 it was pointed out that an estimate of the bias arising from 
the use of a biased method of estimation could be obtained by comparing the 
biased with the unbiased estimates for relevant subdivisions of the sample. 
In the common case in which the biased estimate under consideration is the 
unweighted mean of some variate z, whereas the unbiased estimate is provided 
by some form of weighted mean of z, an alternative approach to the problem 
is provided by the calculation of the regression of x on the weights w. If B 
denotes the estimated bias and b the regression coefficient of z on w we have 


Z Sz Swz _ S(w—2)(s—2) 
B=z7- 0 n Sw Sw 


= g(w— w? 
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bS (w — 5} 
E Sw 


Thus for a given distribution of weights the magnitude of the bias is pro- 
portional to b. In particular, if the value of b for the population is zero there 
will be no bias, and if b is the same for different domains of study which have 
similar distributions of the weights the comparisons between the unweighted 
means for the various domains will be free from bias. The contribution to the 
sampling error of B due to error in estimation of b can be obtained by calculating 
the error variance of b from the formula given in Section 7.12. There will be 
a further contribution due to variation in the distribution of w from sample 
to sample, but this can ordinarily be neglected. 

It should be noted that the calculation of b does not give a more accurate 
estimate of the bias of the unweighted mean than that provided by the difference 
of the weighted and unweighted means. The two estimates are in fact identical. 
The calculation of the standard error of B does, however, provide an indication 
of the probable limits of the bias. 


Hence B= 


Example 10.7 
Use the above method to assess the evidence for bias in the estimate of 


the dressing of nitrogen per acre derived from the unweighted mean over all 
fields of Example 6.19. 


In the notation of Example 6.19, w=g’ g” x and z=r=y/x, The 
quantities Sw, Sz, Sw*, Swz, Sz? can best be calculated for each size-group 
separately, before introduction (where necessary) of the factors g’ and g”. 
Sw and Swz have already been given in the formula for F. Sz is given in Table 
6.19.c. The results for the whole sample are : 

n= 6T Sw = 58,229 Sz = 29-36 , 
S (w— W= 44,849,800 S (wv —w)(z —%)= 2996 S (z — Z) = 3-4822 
b = 2996/44,849,800 = + 0-000,066,80 
B = — 0:000,066,80 x 44,849,800/58,229 = — 0:0515 
This agrees with the difference 0-438 — 0-490 = — 0-052 of the unweighted 
and weighted means. Following the method given in Section 7.12 and illus- 
trated in Example 7.12.a we find 
S.E. (b) = + 0-000,033,6 
Hence 
S.E. (B) = + 0-0258 
The actual value of B is almost double its standard error. There is therefore 
some evidence that the unweighted mean is biased, but the magnitude of the 
pias is not accurately determined. The limits of error given by plus or minus 
twice the standard error are + 0-0001 and — 0-1031. 
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It should be noted that the amount of bias obtained in sampling a population 
depends not only on the properties of the population, but also on the sampling 
method followed. Changes in the sampling method will consequently affect the 
bias. In this example, for instance, the weights are affected by the first-stage 
sampling fractions. If sufficiently extensive data were available, the biases to 
be expected with different sampling fractions could be estimated by adjusting 
the contributions of the different strata to Sw, Sz, Sw, Sw, Ss? so as to 


conform to the new sampling fractions. 


10.8 A simple method of sampling with probability proportional to 
size 

The standard method of selecting units with probability proportional to 
size is to form the running totals of the sizes of the successive units, 7.e. the totals 
Xy Wy F Xa Hy de T Xo -ee and then to select the required number of 
numbers at random in the range 1 to X, where X is the total of all the x’s. 
If a number which is greater than x, and less than or equal to x, + 2g is selected, 
for example, then the second unit is chosen. 

If the number of units from which selection has to be made is large the 
calculation of these running totals requires a considerable amount of labour, 
particularly if a printer adder is not available. In some circumstances, indeed, 
this labour is sufficiently great, relative to the other work of the survey, to rule 
out entirely the use of sampling with probability proportional to size. 

A simple and ingenious method of overcoming this difficulty has been 
devised by D. B. Lahiri. If there are N units, and M isa number greater than 
or equal to the largest x (the x’s being expressed in suitable units), then for 
each unit that has to be selected two numbers and y are selected at random 
from the ranges 1 to M for u and 1 to N for v. If the size of the »’th unit is 
greater than or equal to 4 the unit is selected. If it is less than w the unit is 
rejected and a further pair of random numbers is chosen, the process being 
repeated until a unit satisfying the condition is obtained. f ! 

There is no need to determine the largest x exactly, though if an excessively 
high value of M is taken this will lead to an unnecessarily large number of 
rejections. If there are relatively few large values of x the number of rejections 
will in any case be large; but there is no way of avoiding this (other than the 
formation of the cumulative totals) unless the large values can be picked out 

re separate strata. When the size range is large 


and segregated in one or mo 
such stratification is in any Case frequently advisable for other reasons. 
In two-stage sampling in which the second-stage units require to be selected 


with equal probability the difficulty of a large size-range can often be overcome 
by selecting two or more second-stage units from very large first-stage units. 


Example 10.8 

Devise a method of selecting 
approximately proportional to size W 
survey discussed in Example 10.6. 


a sample of wheat fields with probability 
hich can be used in the rabbit damage 
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If no records of the acreages of the crops are available farm by farm at the 
time the crops are sown, sampling with probability exactly proportional to area 
of crop is impossible. The acreages sown in the preceding year are, however, 
given in the June 4th returns, and the changes from year to year in the more 
important crops on a farm are likely to be small. If, therefore, farms are selected 
with probability proportional to acreage of wheat in the preceding year, and from 
each selected farm one field of wheat is selected from all the fields growing 
wheat with probability proportional to the acreage of these fields, the process 
of selection will give selection with probability proportional to the areas of the 
wheat fields in the district covered, except in so far as the areas of wheat grown 
on the farms in the year of the survey differ from those grown on the same farms 
in the previous year. 

In a survey of rabbit damage covering a large area it would be impracticable 
to form running totals of the wheat acreages of all the farms. Lahiri’s method, 
therefore, provides a useful alternative. Owing to the relatively small number 
of farms growing a large area of wheat the number of rejections will be somewhat 
large, even if more than one field is taken from the farms with very large areas 
of wheat. The number of rejections may be estimated if the approximate 
distribution of the wheat acreages is known. If the distribution of wheat acreages 
is that given in Table 7.2, for example, and the value of M chosen is 80, the 
chance of retaining a selected farm with acreage of wheat x (x < 80) is «/80. 
If all farms in the second size-group had 5 acres of wheat, all those in the third 
15 acres, etc., the average number retained when all the farms of the table are 
included in the selection process will be 

5 5 à 15 25 
x 30 t x aot xg tee tHL HLF L= 25.25 


Since, however, farms with no wheat in the previous year may carry wheat 
in the current year some of these should also be selected. A reasonable rule 
will be to treat all farms which carry less than 5 acres of wheat as if they were 
carrying 5 acres. The chance of retaining a selected zero farm will then be 
+, so that the number retained will be approximately 30 out of 125 or about 
24 per cent. Not all of these will be found to carry wheat in the current year. 
If x > 80 the farm will always be retained. If w lies between 80 and 160 
one field will be taken if x — 80 < y, and two fields if x — 80 > #. Similarly 
two or three fields will be taken if x lies between 160 and 240, and so on, The 
expected number of farms on which two or more fields will be retained can be 
calculated as above. The expected number with two or more fields will be 


fag By gp E 
Xap tix gts xgtixgtia2 


and of these the one with 260 acres will have 3 or 4 fields. 
For the same number of sampled fields the selection of the fields with 
probability approximately proportional to area may be expected to give a gain 


in precision of the same order as that found for the unweighted estimate of 
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Example 10.6. The small differences in weighting which will result from 
lack of knowledge of the current year’s wheat acreages are not likely to increase 
the variance appreciably. Indeed these weights are not likely to be sufficiently 
associated with the amount of damage to make the unweighted mean appre- 
ciably biased. On the other hand, if a random selection of farms with equal 
probability were made, the use of an unweighted mean might give a seriously 
biased estimate of damage since the degree of damage may be associated with 
size of farm. 

The relative precision calculated in Example 10.6 does not take into account 
the sampling of the farms that grew no wheat in the preceding year. An 
accurate estimate of the relative precision can be made when results are available 
from an actual survey, or from data which give the areas of wheat on the same 


farms in two consecutive years. 


10.9 Estimation of error in two-stage sampling with uniform overall 
sampling fraction and selection at the first stage from within 
strata with probability proportional to number of second- 
stage units 

This important special case has been discussed in Sections 3. 10 and 8.13. 

The formulæ for the variance given in Section 8.13 can be stated in a somewhat 

(see Gray and Corlett, 1950, D’). Instead of considering 

the variability of the values of rij between the different first-stage units of a 

stratum we may consider the variability of the corresponding totals of y for 

the different first-stage units. If the sample total of y for the jth unit (presumed 
selected) of the ith stratum is denoted by Yi (= Sij (y)), the estimate of the 
variance of these totals within stratum 7 will be given by 

1 we 

TEO ae Vij) 

be very few (often two) first-stage units per stratum, 

te will be required, derived from the estimates of 

be denoted by st”. The formula for the variance 


more elegant form 


Nees 
Si = 


Since there will usually only 
some form of pooled estima 
the various strata. This can 
then becomes 119 , zr 
V(Y) =e {4E ni’ (1 — fi’) + of ni fi u a )} h 

z : mber of selected second-stage units in the 7th stratum. 
anes E similar in form to that for V o k 4 two-stage sample 
with equal probability of selection at the first stage. This latter formula can 


y) gi in Section 7.17. 
be easil i the formula for V (Y) given in an 
Ta Kan FT d-stage units in each first-stage unit is taken as the 


measure of size, the same number, nl (= fNilni’ 5 of second-stage units w 
require to be selected from each first-stage unit E Een eann aa 
Tij = y. In this case we can of course work Siar we my ee 

ij or the means Jij. If, however, the preliminary a ee: ee ae 
second-stage units in each first-stage unit are not exact, as example may 
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occur when the second-stage frame for the selected first-stage units is constructed 
after selection of the first-stage units, the formula given above, based on totals, 
still holds. An actual case in which this contingency occurred is described in 
Section 4.16. 


10.10 Relative efficiency of sampling with probability proportional 
to size and stratified sampling with variable sampling 
fraction 

The relative efficiency of sampling with uniform probability and with 
probability proportional to size of unit has been discussed in Section 8.9. A 
further question that arises is how sampling with probability proportional to 
size compares in efficiency with stratification by size and the use of a variable 
sampling fraction. 

For any particular type of material this question can be dealt with by 
comparing the variances of the two types of sample, calculated by the methods 
already described in Chapter 8. It is worth noting, however, that the analysis 
of variance procedure provides a rapid and elegant approximate method of 
making this comparison. 

If we have a random sample taken with probability proportional to size x 
and we stratify this sample (after selection) into size-groups of x, we can perform 
an analysis of variance between and within size-groups on the values of r 
obtained from the sample. This can be arranged as in Table 10.10.a, where 
Vw (7) represents the pooled estimate of the variance of r within size-groups. 


TABLE 10.10.a—ANALYSIS OF VARIANCE OF 7 


Degrees of freedom Mean square 
Between size-groups t—1 
Within size-groups E (m — 1) Va (7) 
Total n—1 V (r) ‘ 


If the ranges of the size-groups are sufficiently small for the variation in 
size of x within size-groups to be neglected, the sum of squares of r within size- 
groups can be taken as equal to © (mj — 1) s;?/ž;?. If in addition there is no 
great variation in the V; (r) for the different size-groups, or if all n; are large, 
we have approximately 


+ 9:2 |g; > o 2/x 2 
Vw (r) = ante = Zati (10.10) 

If the number of units in the ith stratum of the stratified sample is taken 
as ni = n Xi/X the total number of units in the two samples will be equal, 
and the sampling fractions will be proportional to Xj, and therefore about 
optimal. In this case fi = mi/Ni = nXi/X, and from formula 7.6.a 

V (Y) = NEV (F) = (X?) 2 mi sê (1 — fi) 2 

From equation 10.10 this approximates to X? V, (r)/n, apart from the factors 
a —fi)- 
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For the sample with probability proportional to size V (Y) = X°? V (r)/n 
(Section 7.15). Consequently the ratio of the mean squares V (r)/V,, (7) in the 
analysis of variance gives an estimate of the efficiency of stratified sampling 
relative to sampling with probability proportional to size. This estimate is 
approximate because (a) it has been assumed that the variation in x within a 
size-group can be neglected, (b) corrections for finite sampling, 1 — fi, have 
been omitted, and (c) the n; — 1 have been replaced by ni. Allowance for 
(a) will cause a decrease in the estimate, allowance for (b) will cause an increase, 
while (c) is not likely to be important. Some further gain may be expected in 
stratified sampling by taking optimal sampling fractions instead of fractions pro- 
portional to mean size. 

The inverse of the process can also be used, starting with the data of a 
stratified sample with variable or uniform sampling fraction, or with the data 
of a random sample stratified after selection. This process is illustrated in the 
example below. Apart from saving computation it has the merit that reference 
to the original data for the calculation of values of 7 (which would be necessary 
if the formula for sr? of Section 8.9 were used) is not necessary. 


Example 10.10 
Using the data on Hertfordshire farms described in Section 3.7, etc., and 
the within-size-group variances of Table 10.4.a, compare the relative efficiency 


of sampling from within size-groups with sampling fractions proportional to 


the mean farm acreages of the size-groups, and unstratified sampling with 


probability proportional to farm acreage. 


As before, we may take x to represent farm acreage (acres crops and grass), 
and y to represent wheat acreage. We shall require rough estimates of Xi 
and ¥; for all size-groups- For X; the size-group means were assumed to be 
situated at one-third the group interval from the lower limit of the group. 
For Ş; weighted means were calculated from Tables 6.5.b, 6.6.b and 6.7.b. 


TABLE 10-10 b—HERTFORDSHIRE FARM DATA: CALCULATIONS FOR THE 
* CONSTRUCTION OF AN ANALYSIS OF VARIANCE OF 7 


i $ 7, T 2 252 ri = 
ro D O É 6) a (a) (8 = 
6- 10 0-2 -02 2 02 52 104 

21- 30 17 -0567 15 -01667 10-8 aby 

51- s5 10 1294 160 -02215 44-2 5:7195 

151- - 900 31 -155 650 -01625 80 12-4 
301- 365 17 -2110 1700 -01276 76-6 16-1626 
501- 600 172 -2867 4500 :0125 30 8-6010 


246-8 43:5995 
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The values obtained are shown in Table 10.10.b. Taking sampling fractions 
of Xi/1000 we obtain from Table 10.4.a the values of ni shown. The rest 
of the calculations in the table are self-explanatory (F = = nj Fi/Z ni). , 
From columns 4 and 8 we find for the sum of squares of 7 between size- 
roups 
ili z È ni fP — f D ni Fi = 0-8728 
and from columns 6 and 7 we find for the estimated sum of squares of r within 
size-groups 
E (ni — 1) s2/K? = 3-8152. 
The reconstructed analysis of variance of r therefore takes the form shown 
in Table 10.10.c. 


TABLE 10.10.c—RECONSTRUCTED ANALYSIS OF VARIANCE OF ” WHERE 
SAMPLING IS WITH PROBABILITY PROPORTIONAL TO SIZE 


Degrees of Sum of Mean 
freedom squares square 

Between size-groups . 5 0-8728 
Within size-groups . 240-8 3-8152 -01584 
Total . š 5 245-8 4-6880 -01907 


The approximate relative efficiency of the two methods of sampling is 
therefore -01907/-01584 = 1-20. Owing to variation of x within size-groups 
this is a slight overestimate when all sampling fractions are small, as explained 
above. The correction to the total mean square from this cause is in fact of 
the order of — -0015. Unless all sampling fractions are small, however, 
corrections for finite sampling are required. This will increase the efficiency 


of the stratified sampling, as will adjustment of the sampling fractions to their 
optimal values. 


10.11 Choice of probability function in sampling with vaviable 
probability 

Sampling with probability proportional to size is one form of what may be 
termed sampling with variable probability. The essence of such sampling is 
that the probability of selection of the different individual units is made pro- 
portional to some known quantitative characteristic of the units themselves. 
The same general theory will apply whatever the measure adopted. In two- 
stage sampling the number of second-stage units often provides a convenient 
measure of the size of the first-stage units (Section 8.13), but in some cases 
greater efficiency will be obtained if the probabilities are taken proportional 
to some function of the number of second-stage units, instead of proportional 
to the actual number. 

This question has been discussed by Hansen and Hurwitz (1949, A’). 
They consider the case of two-stage sampling with uniform overall probability 
(self-weighting) and a cost function of the form 

C=c,n, + czna + Cy Ng 
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where 7, is the number of selected first-stage (primary) units, 7, is the expected 
number of second-stage units included in the selected first-stage units, and 7, 
is the number of second-stage units included in the sample. The first and 
third terms of this cost function correspond to the terms of the cost function 
of Section 8.17 (d). The second term arises when a frame has to be con- 
structed for each selected first-stage unit, e.g. by listing or ground survey, and 
there are other costs, e.g. of travel, which are proportional to the total number 
of second-stage units in the selected first-stage units. Taking an illustrative 
example from the 1940 Census of Housing, Hansen and Hurwitz tabulate the 
relative efficiencies of selecting first-stage units with (a) equal probability, 
(b) probability proportional to the square root of the number of second-stage 
units, and (c) probability proportional to the number of second-stage units, 
for various ratios between cı, €g and ca. For most of the chosen ratios, selection 
with probability proportional to the square root of the number of second-stage 
units is most efficient. Even when cs is zero, method (b) is only slightly less 


efficient than method (c). 


10.12 A simple relation between relative precision and relative 


efficiency l 
It is worth noting that when two sampling methods are being compared 
of them is random or stratified with uniform sampling 


fraction, there is a simple relation between the relative precision and relative 
efficiency of the two methods. Denote the two methods by Sı and S,, So 
being random or stratified with uniform sampling fraction. Let the relative 
precision (S,/S,) of the two methods for a given sample size be P, and the 
relative efficiency when S; is of the given sample size be Æ. Then the relation 


in question is = = 

P= —f)P 
where, is the overall or average sampling fraction of Sy, ie. the number of 
units in the sample divided by the number an the population. From this relation 
the relative efficiency can be obtained immediately from the relative precision 
and vice versa. The specification of the size of S, is necessary because both P 


and E will with variations in this size. 
The cS Maa depends for its derivation on the fact that for a random 


or stratified sample with uniform samp 
is of the form å ( 4 F) where Ais independent of x. Itis therefore applicable 
zw I | 
n N, : ; 
to the case in which domains of study cut across strata, provided we admit the 


en i -ample 9.5. Consequently the relative efficiencies 
OF the Tia ee Tal example can be calculated directly from 


Table 9.5.b. 


Example 10.12 4 5 
Calculate the relative efficiencies of the sampling methods of Example 9.5 


from the relative precisions given in Table 9.5.b. 
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We have T = 614/3296 = 0-186. For total acreage of domain B, for 
example, the efficiency of a stratified sample with a uniform sampling fraction 
relative to the sample with variable sampling fraction is 0-186 -+ 0-814 x 0-65 
= 0-72. The precision of a random sample relative to the sample with variable 
sampling fraction is 0-65 x 0-71 = 0-46, and the corresponding relative 
efficiency is therefore 0-56. 


10.13 Detection of gross errors: use of ratios and regressions 


We have in Sections 5.6 and 5.20 referred to the importance of eliminating 
and controlling errors in the recording and statistical abstraction of survey 
data. Some further discussion of this important subject may, however, not 
be out of place. 

It is frequently not realised how difficult it is even for skilled observers and 
recorders to make measurements and to record facts without from time to 
time making errors. Unless far more elaborate precautions than are usually 
customary or possible have been taken, it can be confidently predicted that any 
large body of recorded measurements and observations will contain a percentage 
of errors. Even when this percentage is small the errors themselves are fre- 
quently of such a nature and magnitude that if uncorrected they seriously 
detract from the value of the results. 

Fortunately, the data themselves often permit statistical checks between 
the various measurements and observations. Qualitative characteristics can 
be examined for inconsistencies, such as that a male has borne children, and 
improbabilities, such as that a woman of 20 is credited with 5 children. Highly 
correlated quantitative characteristics provide a mutual check on gross errors 
in either variate. Sometimes an exact relation exists such that the sum of a 
number of variates is equal to a further observed variate. 

In large-scale surveys the possibility of utilising internal checks of this 
nature should always be explored. Such checks provide a control of the‘quality 
of the recording and abstraction. At the same time, if properly applied they 
can in many cases effectively eliminate most of the gross errors and thereby 
substantially improve the value of the results. The amount and type of checking 
required depends very much on the nature of the survey, but there is a general 
tendency to underestimate the amount that is worth while, often with disastrous 
results: as errors are gradually brought to light in the course of the analysis 
it slowly becomes apparent that more thorough checking and complete recom- 
putation is the only possible course. 

For quantitative characteristics ratios and regressions provide a valuable 
method of checking the original data and eliminating gross errors. When 
two variates are highly correlated, values which are in no way exceptional for ` 
either variate separately stand out as clearly abnormal when the two variates 
are considered in conjunction. The simplest method, if the data are not too 
numerous, is to plot one variate against the other. The data of aberrant points 
can then be examined. 
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If the error is in the recording, this will not be revealed by the data them- 
selves. Sometimes a repetition of the doubtful observations can be made, but 
this will not often be possible. Sometimes the error is an obvious numerical 
one, such as a round 10 or 100, which can be corrected. Often, however, 
there will be no way of determining the correct value. In such cases the 
measurement will either have to be rejected, or an estimated value substituted. 

Rejection of the measurement is the best and simplest course if this does 
not entail too much loss of information on other associated measurements, or 
throw the sample out of balance. Incomplete data are, however, frequently 
very troublesome to handle statistically, and consequently rejection of the 
measurement often necessitates the rejection of the unit asa whole. Substitution 
of an estimated value provides a simple method of preserving the remainder 
of the data on the unit in question. Usually an estimate can be derived from 
one or more of the regressions, determined graphically. 

Sometimes there are independent records which can be referred to when a 
recorded value is in doubt. In a recent anthropometric survey (Healy, 1952, 
D’), for example, photographs of children were taken at the same time as 
measurements were made. All the recorded measurements were checked in 
pairs, by plotting, the measurements which were most highly correlated being 
chosen as members of the pairs. Certain approximate summation checks were 
also possible. Aberrant values were checked against the original photographs. 
It was found that practically all the queried values were in fact in error. 

In this survey the recorded data were already punched on Hollerith cards. 
The plotting was carried out semi-mechanically, by sorting and making a card 


count of the numbers in each cell of the relevant two-way table, the actual cell 
ered by hand on a two-way diagram. A method has 


i t 
numbers being then en king a complete plot on a Hollerith tabulator (King, 


also been devised for mal 


1949, B’). : ? . 
In extensive work all plotting can be dispensed with, once the ratios or 


regressions and the limits of error about them have been defined. The task 
is simplest with a ratio, since all/thatnsnecessery ra eibnitz gie ehete 
values of the ratio and examine those falling outside the prescribed limits, or 
to arrange for some mechanical sorting which effects ae gih oo ly 
calculating the individual values. The same methods apply in principle with a 


5 i ire greater elaboration. k j 
ee ET, en found, from bitter experience, to be necessary 


Check is kind have be AAN 
in all exact anthropometric work, ‘Many ofthe errors that ecrin thie pe 
of survey can be traced to incorrect scale pea) A i creat La ea 
Automatic recording devices, which would eN fields where a 
clearly require to be developed for use in this and in othe: . 


Sets of exact measurements have to be ae oale d ate aae 
If no external checks, such as photograp™® 


becomes much 
aah suspected errors n 
the pro jection or correction ille in Associated 
ae aes K as gees is well incase DAEA of observations 
easurement. ; M: les have been proposed for the re) 
rements, any rule: 
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from internal evidence, but they are of limited value, since they are based for 
the most part on the assumption that the underlying distribution is normal, 
an assumption which may be far from the truth. The best guides are probably 
knowledge of the material being handled, a study of the distributions involved, 
a reluctance to reject an excessive number of observations, and common sense., 
A further point to remember is that once rejection is resorted to, the estimates 
of variance become suspect, since it is always the extreme values that are 
rejected. On the other hand the existence of a few gross errors introduces 
an instability into the variances which may not exist in the actual material. 
For exact work, in fact, gross errors must be avoided at all costs. 

Another rejection problem of a rather different type arises when a unit 
known to be exceptional happens to occur in a sample. Thus in a recent 
agricultural survey of Jamaica a few small farms were obtained which were 
almost wholly devoted to coconuts, an exceptional crop for smallholders. 
With the variable sampling fraction used, the inclusion of these farms would 
have seriously distorted the district figures, and they were therefore omitted. 
A parallel example which occurred in the one per cent. sample of the British 
Census is described in Section 10.16. In such cases rejection is the simplest 
course to adopt, but such rejections should always be reported, and if necessary 
included in the national estimates derived from the sample. Essentially what 
is being done is to redefine the population with certain exceptional units 
excluded. Since the number of these units in the population is unknown there 
will be slight uncertainty as to the exact number of units in the remainder of 
the population, but this is not likely to introduce any appreciable errors. 


10.14 Lattice sampling 


If we have a square area of side p, divided into p? unit squares, we can select 
a sample of p unit squares in such a manner that every row and every column 
of the large square contains one of the selected unit squares. Such sainpling 
is a special type of double stratification without control of sub-strata (Section 
3.4). The rows and columns of the square can represent any two-way classi- 
fication of the material in which there are equal numbers of classes in the two 
classifications and one unit in each sub-class. 
Similar schemes are possible for three- or more way classifications. With 
a three-way classification, for example, with p? units, a sample of p units can 
be selected such that one unit is taken from each class of each classification. 
Alternatively a sample of p? units can be selected in such a manner that there 
are p units in every class of each classification, the p units belonging to any one 
class of any classification being so selected that one unit falls in each of the 
classes of the other two classifications. A sampling scheme of this last type 
will be defined by a Latin square of side p®, that is, a square pattern of p letters 
in which each letter occurs once and once only in each row and each column. 
Table 10.14.a shows a 6 X 6 square. The rows, columns and letters of the 


square represent the three classifications. 
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TABLE 10.14.a—EXamPLe OF a 6 X 6 LATIN SQUARE 


C BE AFD 
FDA E @ RB 
DFCBAE 
RAD CB F 
BCE DEEA 
A Ee Fo p g 


Some element of randomization must, of course, be introduced in the 
selection of actual samples. If an estimate of error is required the type of 
randomization is governed by the form of the estimate, and is described below. 
If no estimate of error is required a sample of p units from a square can be 
selected by selecting a unit from the first row at random, then selecting a unit 
from the p — 1 unoccupied columns of the second row at random, and so on. 
The selection of a sample of p units from a cube is similar. For the first two 
classifications a sample of p units is taken from a square, as above, and p letters 
are then allocated to these points to indicate the third classification. In the 
case of a sample of p° units from a cube, the rows, columns and letters of any 
available p x p square can be randomized among themselves. The details of 
this procedure are explained below. Examples of squares up to 12 xX 12 are 
given in Statistical Tables, but if these are not available a randomization of one 
of the diagonal squares (i.e. a square with each letter on lines parallel to a 
diagonal) will suffice. This may be obtained directly by allocating the letters 
in the first column at random, and then allocating those in the first Tow (except 
the first) at random. The further rows may then be filled in by writing the 
letters in the same order as in the first row, beginning with the determined letter. 
No further randomization is required. : oe. 

Sampling of this kind, which may be termed lattice sampling, is of 
particular use when the material to be sampled is of a type that lends itself to 
multiple subdivision on a square or cubic pattern. One such type arises in 
sampling schemes which extend over both space and time. Ris example is 
Provided by a sampling scheme for estimating the catches of fish landed at 
various ports along the coast of India. Catches are there landed at all hours 
of the day and night, the times of landing depending on the tide, weather, etc. 
On any one part of the coast the times of landing at the different ports are highly 
correlated. It was therefore proposed that a sampling sete be adopted in 
which every port would be sampled every day, the times 9 alters so chort 
that for a group of p neighbouring ports a different part o ae ay was ae 
at each port. Moreover the times of day were to be so ‘ete l that over a perio 
of p days all times of the day were covered at each port. This eure ee 
Latin square in which the rows, columns and letters represent ports, days an 
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times of day. A more complicated example involving two-stage sampling is 
provided by proposals for road-traffic censuses outlined in the next section. 

Various two-dimensional schemes of the lattice type were suggested by 
Tepping et al (1943, D) under the name of deep stratification. Their performance 
was tested out on housing data in an American city. The sampling unit was a 
city block, rent and size (number of housing units) being taken as the two types 
of subdivision. Each rent-size classification contained a number of blocks, 
one block being selected at random from each chosen cell. If the number of 
blocks is to be the same in all cells the rent classification must be varied within 
the different size subdivisions, or vice versa. If this is not done, the simplicity 
of the scheme is sacrificed. This lessens the effectiveness of schemes of this 
type for sociological or economic material, particularly if the two classifications 
are highly correlated. 

More complicated schemes, in which different probabilities of selection 
are assigned to different patterns, have recently been discussed by Goodman 
and Kish (1950, A’). A scheme of the same general type appears to be used 
in the Canadian Labour Force Survey (Keyfitz and Robinson, 1949, F’). 

The problem of estimating the error of lattice samples must now be con- 
sidered. A general paper on the subject by H. D. Patterson will shortly be 
published. Case (d) below is discussed by Hansen, Hurwitz and Madow 
in their forthcoming book. 

(a) Square lattice 


No valid estimate of error is possible for a sample containing p units. With 
a sample of 2p units, however, a valid estimate is possible. The simplest 
procedure is to divide the lattice into p mutually exclusive sets of p units. Any 
Latin square effects such a subdivision, the letters defining the sets. These 
sets may themselves be regarded as complex sampling units. If two sets are 
selected at random from all p sets the contrasts between the sets will therefore 
provide an estimate of the sampling error with one degree of freedoms This 
estimate of error will not be of much value if only one square is sampled. If 
there are a number of squares each square will contribute one degree of freedom 
to the estimate of error. 

When p is even, however, there is an alternative procedure by which a 
sample of 2p units can be made to yield 4p degrees of freedom for error. We 
start with a basic pattern made up of 2 x 2 squares, such as that shown for 
an 8 X 8 square on the left side of Table 10.14.b. We then rearrange first 
the rows in random order amongst themselves, and secondly the columns in 
random order amongst themselves. If the two orders are 21748365 and 
13524678 the arrangement on the right of the table will be obtained. 

The symbols may now be taken to indicate the values actually observed. 
An estimate of the error variance per unit with 4 degrees of freedom will then 
be provided by 

$=} [Ka — az — 43 + 44)? + (bı — ba — bg + by)? 
+ (c3 — c2 — cs + c4)? + (di — do — dg +- da] 
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TABLE 10.14.b—8 X 8 ROTATIONAL SCHEMES FOR TWO TYPES OF SUBDIVISION 


Basic pattern Actual arrangement 
T 2 pta GGTS E 8 & a oF 

lla, a, l | as ay 

2 |as ay 2| a a; 

3 b b 3 d, d; 

4 b ba 4 bs by 

5 C Cs 5 ds dy 
26 b 0G 6 b, ba 

7 dy d, if Ca cy 

8 ds d; 8 c1 Cy 


In the general case the divisor will be 2p. Allowing for finite sampling, the 
sampling variance of the mean will be (1 — 2/p)s?/2p. With the specified 
randomization process it can be shown that this estimate of error is unbiased. 


(b) Cubic lattice, sample of 2p units 

In the case of a cubic lattice no unbiased estimate of error of the above 
type appears to be possible with a sample of 2p units. The best that can be 
done is to obtain one degree of freedom by selecting and contrasting two out 
of p? mutually exclusive sets of p units. Such a group of sets may be constructed 
as follows. Let (a, 4» ++ - ap), (bi bos + + + bp), (Ci, Cos - . . Cp) be three 
random permutations of the numbers 1 to p. Also let # and y be two random 
numbers, the same or different, but not both 1, between 1 and p. Then the 


lattice co-ordinates of the two sets are 
(a by, 61), (4x by, Co) e- 
(a, bp cy), (a9 bg a1 Cy + Daan 
with the proviso that when any suffix is greater than p, it is reduced by p. For 
example, if for a 6° lattice the random permutations are (5 6 1 2 4 3), 


(241365), (436 25 1) and ĝ, y are 6, 4, the two sets are 
(5, 2, 4), (6, 4 3» (L 1 6), (% 3, 2), (4, 6, 5), (3, 5, 1), 
(5, 5, 2), (6 2 5), (L 4 1), (3 L 4 6 3, 3), (3, 6, 6). 
: Bie cay MN bob ot . imilar to that of the square lattice 
With a sample of 4p units an estimate simi 
sample is ine eee p is even, using a basic pattern of 3p 2X22 cubes. 
Unfortunately, however, the different variance components enter in different 


igati that the unbiased estimate 
antes : 3 ' ate. Investigation shows tl 
Peeps int tis a quares corresponding to the two-factor and 


involve: i e mean S F 
thre ee serene T the 2 x 2 X 2 cubes. It will rarely be more accurate 
359 


SECT. 10.14 SAMPLING METHODS FOR CENSUSES AND SURVEYS 


than the estimate based on the three degrees of freedom obtained by taking 
four sets of p units at random from p? mutually exclusive sets. 


(c) Cubic lattice, sample of 2p units 


When p is even, an unbiased estimate of error with tp* degrees of freedom, 
similar to that for the square lattice, can be obtained. The basic pattern is 
made up of 2 x 2 x 2 cubes. It can be represented by a Latin square made 
up of 2 x 2 squares, and a second Latin square obtained by reversing all the 
2 x 2 squares of the first square. The letters then represent the third co- 
ordinate. Table 10.14.c shows a suitable pair of squares for p= 6. Larger 
squares can be constructed in a similar manner. 


TABLE 10.14.c—BasIc PATTERN FOR A SAMPLE OF 2p? UNITS FROM A CUBIC 


LATTICE 
ATR C DY Bt in B A D GPE 
BA D €C F E A B GD) E p 
C DE F A B DC fF EPA 
DEG T E-B A CDE F A B 
ERR A B CD EF BA DG 
tee BY A’ D Q HOU A iB ORD 


« 

As before, the rows, columns, and letters must be randomized, the same ran- 
domization being used for each square. The randomization of letters is effected 
in the same manner as that for rows and columns, writing the letters 4-F 
in random order, say B F A C D E, and substituting B for A, F for B.A for 
C, ete: 

For the estimate of error we must calculate for each of the original 2 x 2 x 2 
cubes the difference between the sum of the four units of the first square and 
the sum of the four units of the second square belonging to that cube. These 
differences can be represented geometrically by assigning opposite signs to the 
points at the two ends of each edge of the relevant cube. The components 
of the differences can easily be picked out, since (with the letter randomization 
adopted) B goes with F, A with C, and D with E, and the components of each 
cube occur at the intersection of two rows and two columns which are the same 
for each square. 

There are }p* such differences. If these are denoted by d we have for the 
variance of a single unit i 


The sampling variance of the mean of the 2p? units is therefore (1 — 2/p) s?/2p?. 
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(d) Estimation of error from complete data 


If we have available data on all units of one or more lattices, we can without 
difficulty estimate what the sampling error of lattice sampling would have 
been. Such information is necessary in the planning of surveys both for 
determination of size of sample and for the study of the relative efficiency of 
lattice sampling and other methods. 

The procedure of the analysis of variance can be applied to the complete 
data of a square lattice in the manner of Table 10.14.d. 


TABLE 10.14.d—ANALYsIS OF VARIANCE FOR A COMPLETE SQUARE LATTICE 


Degrees of 

freedom 
Rows (R) . : 5 .  p-l 
Columns (C) s 3 A p-l1 


Remainder (R x C) . ` (p— 1)2 


Total a : . . pa 


The sum of squares for rows is given by the sum of the squares of the 
deviations of the row totals, divided by p, and similarly for the columns. 
The sum of squares for the remainder (known as the two-factor interaction 
R X C) is obtained by subtraction. The remainder mean square then gives 
an estimate s? of the variance per unit in lattice sampling. If we require to 
determine the relative precision of lattice sampling and simple Stratification by 
rows, we calculate a new mean square for columns plus remainder, adding 
the degrees of freedom and the sums of squares. Similarly, rows plus remainder 
gives an estimate of the variance for stratification by columns, while the total 
line gives the estimate for random sampling. va 

The procedure in the case of a p? lattice is similar. Denote the three 
classifications by R, C and L. Summation over L gives a p X P table of totals 
which can be analysed in the manner of Table 10.14.d to give R, C, and 


TABLE 10.14.e—ANALYSIS OF VARIANCE FOR A COMPLETE CUBIC LATTICE 


Degrees of Mean 
freedom square 
R Tage 
c ies 
—1 
Jé 2 
RxC Eo 3 
RXL Gay c 
CKL = 1 D 
RX CXE CI 
Total pza 
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Rx C. The sums of squares must be divided by an additional factor p because 
we are working with totals of p units. Tables for R and L and C and L can 
be constructed similarly. These can then be combined into a single table 
in the manner of Table 10.14.e. The three-factor interaction R x C x L 
is then obtained by subtraction. 

The estimate of error variance for a lattice sample of p? units (based on a 
Latin square) is given by the mean square D for R x C x L. The estimate 
of error for a lattice sample of p units is given by the expression 


2% ={A4B4+C+(p—2)D} (p+) 


If p is large this tends to the pooled mean square for all four interactions. 


(e) Multi-stage schemes 

When each cell of the lattice contains a number of second-stage units, of 
which some only are selected, the estimation of error follows the ordinary lines 
for multi-stage schemes. If an estimate of the second-stage error is not required 
the selection of one second-stage unit from each first-stage unit will normally 
provide an adequate estimate of the total sampling error, since the first-stage 
sampling fraction is not usually large. 

Another type of two-stage sampling arises when the members of one of the 
lattice classifications are themselves a sample from a larger number of such 
classes. An example of this type of sampling is described in the next section 
where the method of estimating the error is explained. j 


10.15 Censuses of road traffic 

Statistics for road traffic, such as total vehicle-miles, passenger-miles. 
and ton-miles, can be obtained either from returns made by vehicle operators 
or from counts and other observations of vehicles passing selected points of the 
road network. Statistics of tons loaded and length of journey can only be simply 
obtained from returns by vehicle operators. The sampling problems of the 
former type of census are straightforward, and need not be further discussed 
here, but those arising in road traffic counts present a number of special features. 
which are of general interest. 

In addition to their use in estimating the volume of road traffic, traffic 
counts are of value in road planning. For this purpose counting points can 
best be located at strategic points in the road network, so as to estimate the 
volume of traffic on particular stretches of road. The counts themselves may 
be confined to a particular week chosen as typical, or possibly to two or three 
such weeks at different times of the year. Nor is there any need to cover 
of the day for which the traffic is known to be light. Classification into 
f vehicle will frequently be required, but not information on loads 

Consequently, unless information on the origin and destination is 
required it will not be necessary to stop vehicles. If automatic counting 
devices are installed the whole period covered may be sampled for purposes 


of classification of vehicles. 


periods 


types o: 
carried. 
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For the purpose of estimating the total volume of road traffic a different 
procedure is necessary. Unbiased estimates of quantities such as passenger- 
miles, vehicle-miles and ton-miles, cannot be obtained if the counting points are 
located at strategic points in the network. Instead some method of random 
or systematic location is necessary.* 

If points are located at random on the roads of a network with a density of 
1 per k miles and from counts made at these points it is found that 7}, 2... . 
vehicles pass the points in a given period, then an unbiased estimate of the 
total vehicle-miles in the period is given by 

vehicle-miles = k (m, + n + . . .) = k S (n) 
Similarly an unbiased estimate of the ton-miles is k S (W), where W is the 
total of the loads of all the vehicles passing a given point in the given period. 
It will be necessary to stop at least a sample of the vehicles passing the chosen 
points to ascertain the loads carried. 

In order to locate points at random on the network a running total of the 
lengths of all the roads and sections of road comprising the network may be 
made in such a manner that each piece of road is included once and once only. 
If the total length is Z miles, any number / less than L then defines a unique 
point on the road network. If j numbers are selected at random between 1 
and L these numbers will define points on the road network which will be located 
at random, each mile point having an equal chance of being selected. To give 
a density of one point per k miles, we take f = L/k. 

Instead of locating the points at random they may be located systematically, 
i.e. at equal intervals with regard to the running total, using a random starting 
point h between 1 and & and selecting the pointssh,h+kh,h+2k.... 

In practice it will be advisable to use a variable sampling fraction, with a 
considerably higher density of points on the more important roads. For this 
purpose the roads must be stratified according to their importance. When this 
stratification has been made, separate running totals can be constructed for 
the different strata. If any large area is to be covered it will also obviously be 
advisable to divide the area into regions and treat each region separately. 

If desired, certain types of road such as roads within city boundaries, and 
minor roads in built-up areas, can be excluded entirely. This will automatically 
exclude the traffic on these roads from the estimates. j 

Sampling may be used in various other ways to increase the accuracy of 
the results with a given expenditure of effort. Lattice sampling is particularly 
useful. Instead of carrying out continuous counts, ‘for example, counts may 
be made at different hours of the day at different points, a rotation being arra aged 
so that all periods of the day are equally covered, and that different periods are 
covered. at the same point on different days. Thus the counts on a group 
of 12 points can be so arranged that on 12 consecutive days each 2-hour period 
is counted on one and one only of the 12 days, and that on each day every 


* The proposals which follow are based on a note prepared for a Working Party 
of the Inland Transen Committee of the U.N. Economic Commission for Europe. 
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2-hour period is covered at some point of the group. Since there is likely to be 
a considerable amount of variation between points, a rotation of this type may 
be expected to increase the accuracy considerably, since a much larger number 
of points can be included for the same amount of counting. 

There will of course be little advantage in using this type of sampling for 
counts if automatic counting devices are available, but the procedure will still 
be of value for sampling to ascertain loads, etc. The number of vehicles that 
have to be stopped may also be reduced by examining only a fraction of the 
vehicles passing during the time a point is under observation. Care must'be 
taken to see that bias is not introduced by this procedure. If every third vehicle 
is examined, for example, there must be no element of choice, i.e. the count 
must be based on the order of arrival. If the fraction is varied from time to 
time owing to variation in traffic density each such variation must be noted so 
that the correct raising factor can be used for each part of the data. 

It is esscntial for objective estimates that the whole of the day and night is 
covered. If the volume of traffic is substantially reduced during the night, 
however, it may be advantageous to use a variable sampling fraction here also, 
covering the night period less intensively than the day period. Similarly if an 
objective estimate of the total annual volume of traffic is required the different 
parts of the year must be properly sampled. 

The calculation of the sampling error is straightforward except for rotational 
schemes. Since the observation points themselves constitute a random selection 
from all possible points, a single Latin square arrangement of observations on 
12 points extending over 12 days with 12 periods in each day forms a second- 
stage sample of one unit out of the 12 units defined by any set of 12 squares 
which together comprise the whole of the traffic passing these points. The 
difference between the totals for a pair of Latin squares from the set will there- 
fore only give an estimate of the sampling error at the second stage. To obtain 
an estimate of the total sampling error, two different sets of 12 points must be 
taken for the two Latin squares. The square for each set should be inde- 
pendently randomized, but the points of each set can be obtained (with some 
gain in precision) by random selection of two points from each of 12 strata 
instead of 24 points from a single stratum. Thus in the simple case of the 
estimation of traffic along a single main road the road can be divided into 12 
equal sections. Two points are then located at random in each section, one 
of each pair of points being allocated at random to the first square, and the other 
to the second square. Each pair of squares yields only one degree of freedom. 
This limitation, however, is not of great importance in schemes which are 
extensive either in space or time, since many pairs of squares will be required 
for the whole scheme. 

10.16 The use of sampling to speed up analysis: the 1951 Census of 
Great Britain 

The value of sampling in reducing the volume of numerical and machine 

nd speeding up the analysis of a complete census has already been 
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stressed in Section 5.21. An interesting example of this use of sampling is 
provided by the 1 per cent. sample of the 1951 Census of Great Britain (General 
Register Office, 1952, C’). This sample was taken with the object of providing 
preliminary results within a year of the census. 

The general procedure of this census was as follows. Large institutions and 
analogous establishments likely to contain 100 or more persons each were 
first identified and listed by the local census officers. These institutions (termed 
special enumeration districts—S.E.D.s—with one institution to each district} 
were enumerated on special institution schedules. The remaining habitations 
of the area were divided into districts termed ordinary enumeration districts 
(O.E.D.s) of such a size that each district could be dealt with by a single 
enumerator. The O.E.D.s were given identification numbers ranging conse- 
cutively from 1 onwards within each local census area. All census officers 
delineated their E.D.s on large-scale maps, and for the purpose of ready reference 
numbered them adjacently as far as possible. The general effect of this is that 
odd and even E.D.s tend to be contiguous. In all there were 49,318 O.E.D.s 
in England and Wales with an average content of about 270 households, or 
860 persons, and 9,730 O.E.D.s in Scotland with an average content of 150 
households, or 510 persons. There was considerable variation in the numbers 
of households and persons in the different O.E.D.s owing to variation in local 
conditions. Each habitation of each O.E.D. was listed systematically prior to 
the actual census by complete traverse of the district by the enumerator con- 
cerned, and the census schedules were subsequently numbered in the list order. 
Each schedule covered one household. 

For the sample from the O.E.D.s each local enumerator was instructed to 
furnish copies of the schedules of all households bearing numbers ending in 25 
if his O.E.D. number was odd, and ending in 76 if his O.E.D. number was 
even. For the sample from the S.E.D.s each local census officer was instructed 
to number the individuals of each S.E.D. schedule consecutively from 1 onwards 
and to furnish copies of the entries bearing numbers ending in 25 for odd- 
numbered §.E.D.s and 76 for even-numbered S.E.D.s. 

Apart from the disturbance arising from the use of only two pairs of end 
digits this procedure if correctly carried out should yield an almost exact 
1 per cent. sample of households in the O.E.D.s and of individuals in the 
S.E.D.s. From the systematic method of selection adopted a high degree of 
stratification within O.E.D.s and S.E.D.s is obtained. ‘The sample will not, how- 
ever, contain exactly 1 per cent. of the population owing to variation in the size 
of the O.E.D. households, which can range from 1 to approximately 100. 

A test of the sample as drawn was made by comparing it with the preliminary 
count of the full census. The result of this comparison for the whole country 
is shown in Table 10.16. The agreement on number of households is satis- 
factory for England and Wales, but there is some excess in the sample for 
Scotland. This disagreement is in fact of no consequence in view of the further 
adjustment (described below) which was made for each area which was to figure 
in the tabulation. The object of this adjustment was to bring not only the 
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TABLE 10.16—1 95l" Census OF GREAT BRITAIN : COMPARISON OF THE | PER 
CENT. SAMPLE AND THE PRELIMINARY COUNT 


1/100th of Excess or defect 
full census of sample 
Sample (preliminary |———— 
count) Amount Per cent. 
Great Britain— 
Total population . . 488,411 — 16 — -00 
Households in O.E.D.s 146,539 +89 + -06 
England and Wales— 
Total population . . 437,158 437,450 — 292 — 07 
Households in O.E.D.s A 131,973 131,998 — 25 — 02 
Scotland— 
‘Total population . E È 51,237 50,961 + 276 + -54 
Households in O.E.D.s e 14,655 14,541 + 114 + -78 


-— 


number of households but also the total population into agreement with the 
preliminary count. This ensured that the published sample totals would 
agree with other published totals which might later be derived from the tabu- 
lation of the full census material. The adjustment has, of course, the further 
effect of somewhat increasing the accuracy of the sample, though at the expense 
of slight distortion in certain respects. 

It is, however, instructive to examine in a little more detail the probable 
causes of the disagreement in numbers of households. If a systematic sample 
of every hundredth object with random starting point j between 1 and 100 
is taken from a number 1004 + k of consecutively numbered objects, where 
h and k are integers and 0 < k < 99, the number in the sample will be % or 
h + 1 according asj > kor < k. The mean number over a large numher of 
samples will be h + k/100. The sampling variance of the number in the sample 
is given by the mean square deviation from the mean number, and is found to be 


h(t 
100 | ~ 100 )* 


If we have a large number of sets containing numbers of objects which differ 
in such a manner that the values of k can be taken to be evenly distributed over 
the range 0-99 the average sampling variance per set will be 


1 9 k i k 
too 200 \! T00 ) 
which equals 0-16665 or very nearly 4. 


Instead of taking each of the j’s independently at random we may select 
them in complementary pairs subject to the condition that j+j’=101, 
assigning each such pair to a pair of sets. Thus if the first of a pair of sets is 
allocated a number 64 chosen at random, the second will be allocated the 
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number 37. By a calculation similar to that given above we then find that the 
average sampling variance per set, when the values of k and k’ can be regarded as 
independently and evenly distributed over the range 0-99, is equal to 0:083325, 
or very nearly y. This is half the variance obtained when each value of j 
is chosen independently.* The reason for the reduction is clear if we remember 
that a run of low values of f will lead to an overestimate, and a run of high values 
to an underestimate. The reduction is due to the balancing of high and low 
values. It does not depend on there being any particular similarity between 
the pairs of sets. If the values of k are highly correlated within pairs the 
variance will be further reduced. 

If the same pair of complementary values is always taken, as in the British 
Census, the distribution of k has to be considered. Analysis shows that with an 
even distribution of k the mean error (or bias) is zero, and the average sampling 
variance is 0-083325 as above. But with any other distribution of k some bias 
will be introduced. 

Since there are 49,316 O.E.D.s in England and Wales the standard error 
of the number of households in the sample due to sampling of the type adopted 
will be +/(0-083325 x 49,316) = + 64. The similar error for Scotland is 
4/(0-083325 x 9,730) = + 28. The discrepancy for England and Wales is 
therefore well within the sampling standard error, but that for Scotland is 4-1 
times its standard error. 

As is pointed out in the report, the discrepancy in Scotland can be accounted 
for by the fact that the selection number 25 is always associated with the odd- 
numbered O.E.D.s and the selection number 76 with the even-numbered 
O.E.D.s. As the O.E.D.s were numbered from 1 upwards in each census area, 
the excess of odd-numbered enumeration districts will be approximately one- 
half the number of census areas. There were 1,225 census areas in England 
and Wales and 1,026 in Scotland. The bias through consistently taking the 
selection number 25 will be approximately per O.E.D. Therefore the 
bias introduced will be } x excess of odds, ze. + 153 for England and 
Wales and + 128 for Scotland. The discrepancy in the number of house- 
holds for Scotland shown in Table 10.16 therefore appears to be fully accounted 
for by this bias. In the case of England and Wales, however, allowance for the 
bias will increase the discrepancy to — 178, 7.e. nearly three times its sampling 
aay Eey discrepancy may have arisen from the form of the distri- 
bution of &. The bias to be expected from this cause, however, can only be 
calculated from the distribution of the numbers of households in the different 
eee the published description it would appear that there would in fact 
have been little difficulty in using all pairs of complementary numbers instead 
of the pair 25 and 76. This would have eliminated the biases discussed above, 
and would have been essential if a preliminary count had not been available. 


*I am indebted to the General Regi 
reduction in variance resulting from the us 
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A suitable method would have been to write down all pairs of complementary 
numbers 1-100 in random order, and also randomize the order within the pairs. 
The numbers of the sequence could then be allotted in turn to the O.E.D.s 
in whatever systematic order they presented themselves. In order to avoid the 
need for fresh randomization the same sequence could be used repeatedly. 

Apart from its effect on the numbers of households selected the use of only 
one complementary pair introduces the danger of a bias of a different type. 
If the numbering of the households in the O.E.D.s is such that the households 
of a certain type tend to be allotted numbers less than 25 this type of household 
will be under-represented in the sample. The method of numbering households 
in the British Census was such that little, if any, bias is likely to have arisen 
from this cause, but in other censuses where the situation is different serious 
bias might well be introduced. 

As already mentioned, an adjustment was made to attain simultancous 
agreement with the preliminary count for numbers of households and numbers 
of individuals. In order to preserve the households as entities in the sample, 
complete households were added or removed. For each tabulation area the 
sizes of the households to be selected for addition or removal were specified 
so as to attain agreement also in the numbers of individuals. In some cases 
where there was an excess of population and deficit of households, or vice versa 
it was necessary both to add and to remove households. Selection of the house- 
holds to be added or removed was accomplished by dividing the full or sample 
records of the area into as many sections as the number of households involved 
and selecting a household of one of the prescribed sizes from each section in 
accordance with a mechanical selection procedure. The adjustment process 
involved the addition of 1,266 new households and the removal of 1,255 existing 
households, both of which are less than 1 per cent. of the original sample total 
of 146,628 households. 

In a few cases this procedure broke down. In the case of Chelsea 
Metropolitan Borough, for example, the sample population of 603 was in 
excess by 94, while the number of households, 187, was in excess by 1 only. 
This was found to be due to a hotel with a population of 100 being included in 
the sample. In order to secure agreement on the population this was removed 
and another household of 6 persons substituted. 

This procedure, which is the only possible one if agreement is to be obtained 
for small areas, must in fact result in the elimination from the sample of a number 
of the exceptional “ households,” such as hotels not sufficiently large to be 
classified as S.E.D.s. The households of this type will therefore be under- 
represented in the sample. Even if these households had been retained the 
information on them would be unreliable when classified by small areas owing 
to sampling variation. A measure of the degree of under-representation can be 
obtained by tabulating the households that were removed and comparing it 
with the corresponding tabulation of the households that were inserted. 

The undertaking has been completely successful in its objective of saving 
time in the tabulation and presentation of the results. A target date of one year 
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frum the census day was provisionally set by which the analyses should be 
completed and the classified results made available in tabulated form. In the 
event the publication of the more important analyses was achieved at a date 
only a few weeks outside the target period and the remainder followed shortly 
thereafter. Previously census results of a similar character have required from 


three to four years for their publication. 
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TaBLE Al—RANDOM NUMBERS 


03 47 
97 74 
16 76 
12 56 
55 59 


16 22 
84 42 
63 01 
33 21 
57 60 


18 18 
26 62 
23 42 
52 36 
37 85 


70 29 
56 62 
99 49 
16 08 
31 16 


68 34 
74 57 
27 42 
00 39 
29 94 


16 90 
11» 27 
35 24 
38 23 
31 96 


66 67 
14 90 
68 05 
20 46 
64 19 


05 26 
07 97 
68 71 
26 99 
14 65 


This table forms part of a large! 
Agricultural and Medical Research by R. A. Fis 
1948), and is reproduced by kind permission o! 


43 73 86 
24 67 62 
62 27 66 
85 99 26 
56 35 64 


TT 94 39 
17 53 31 
63 78 59 
12 34 29 
86 32 44 


07 92 46 
38 97 75 
40 64 74 
28 19 95 
94 35 12 


17 12 13 
18 37 35 
57 22 17 
15 04 72 
93 32 43 


30 13 70 
25 65 76 
387 86 53 
68 29 61 
98 94 24 


82 66 59 
94 75 06 
10 16 20 
16 86 38 
25 91 47 


40 67 14 
84 45 11 
51 18 00 
78 73 90 
58 97 79 


93 70 60 
10 88 23 
86 85 85 
61 65 53 
52 68 75 


eon 
a 
co 
Pa 


44 17 16 58 09 
84 16 07 44 99 
82 97 77 77 81 
50 92 26 11 97 
83 39 50 08 30 


40 33 20 38 26 
96 83 50 87 75 
88 42 95 45 72 
33 27 14 34 09 
50 27 89 87 19 


55 74 30 77 40 
59 29 97 68 60 
48 55 90 65 72 
66 37 32 20 30 
68 49 69 10 82 


83 62 64 11 12 
06 09 19 74 66 
33 32 51 26 38 
42 38 97 01 50 
96 44 33 49 13 


22 35 85 15 13 
09 98 42 99 64 
54 87 66 47 54 
58 37 78 80 70 
87 59 36 22 41° 


371 


46 98 63 71 62 
42 53 32 37 32 
32 90 79 78 53 
05 03 72 93 15 
31 62 43 09 90 


17 37 93 23 78 
TT 04 74 47 67 
98 10 50 71 75 
52 42 07 44 38 
49 17 46 09 62 


79 83 86 19 62 
83 11 46 32 24 
07 45 32 14 08 
00 56 76 31 38 
42 34 07 96 88 


13 89 51 03 74 
97 12 25 93 47 
16 64 36 16 00 
45 59 34 68 49 
20 15 37 00 49 


44 22 78 84 26 
71 91 38 67 54 
96 57 69 36 10 
77 84 57 03 29 
53 75 91 93 30 


67 19 00 71 74 
02 94 37 34 02 
79 78 45 04 91 
87 75 66 81 41 
34 86 82 53 91 


11 05 65 09 68 
52 27 41 14 86 
07 60 62 93 55 
04 02 33 31 08 
01 90 10 75 06 


92 03 51 59 77 
61 71 62 99 15 
73 32 08 11 12 
42 10 50 67 42 
26 78 63 06 55 


33 26 16 80 45 
27 07 36 07 51 
13 55 38 58 59 
57 12 10 14 21 
06 18 44 32 53 


87 35 20 96 43 
21 76 33 50 25 
12 86 73 58 07 
15 51 00 13 42 
90 52 84 77 27 


06 76 50 03 10 
20 14 85 88 45 
32 98 94 07 72 
80 22 02 53 53 
54 42 06 87 98 


17 76 37 13 04 
70 33 24 03 54 
04 43 18 66 79 
12 72 O7 34 45 
52 85 66 60 44 


04 33 46 09 52 
13 58 18 24 76 
96 46 92 42 45 
10 45 65 04 26 
34 25 20 57 27 


60 47 21 29 68 
76 70 90 30 86 
16 92 53 56 16 
40 Ol 74 91 62 
00 52 43 48 85 


76 83 20 37 90 
22 98 12 22 08 
59 33 82 43 90 
39 54 16 49 36 
40 78 78 89 62 


59 56 78 06 83 
06 51 29 16 93 
4495 92 63 16 
32 17 55 85 74 
13 08 27 01 50 
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TABLE A2—THE NORMAL DISTRIBUTION 


Probability of obtaining deviations (positive or negative) greater 
than given multiples of the standard deviation 


Deviation | Probability | Deviation | Probability | Deviation Probability 

zo P zjo Pp 2,6 P 

0-0 1-0000 1-0 “3173 2-0 “0455 
0-1 -9203 1-1 2713 2-1 -0357 
0-2 8415 1-2 +2301 2.2 -0278 
0:3 +7642 1-3 -1936 2-3 0214 
0-4 -6892 1-4 “1615 2-4 -0164 
0-5 -6171 1-5 -1336 2-5 -0124 
0-6 -5485 1-6 +1096 6 -0093 
0-7 -4839 17 -0891 2-7 -0069 
0-8 -4237 18 -0719 8 +0051 
0-9 -3681 1-9 “0574 2-9 -0037 
1:0 3173 2-0 0455 0 -0027 
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BIBLIOGRAPHY ON SAMPLING 


The bibliography to the First Edition was drawn up by Mr. D. R. Read, 
and is reprinted here without change. It was based on a bibliography prepared 
by the Food and Agricultural Organisation of the United Nations. Additional 
references arranged as in the original bibliography will be found on pages 
386-394. The sections of the new bibliography are distinguished by dashes. ` 


The papers have been classified under the following heads :— 
(A) Theory and methods. 
(B) Machine methods. 
(C) Population censuses. 
(D) Sociology, nutrition, health, etc. 
(E) Opinion surveys and market research. 
(F) Economics: surveys of industry, censuses of production, labour 
force, etc. 
(G) Agricultural economics and farm practice. 
(H) Crop estimation and forecasting, etc. 
(I) Forestry and land utilization surveys. 
(J) Estimation of wild populations. 


Since a single paper does not necessarily deal with only one subject, the 
subject: classification must be taken as approximate only. A certain amount 
of general theory, for example, will be found in papers primarily dealing with 
special applications. In some instances where the original paper could not 
be consulted the classification has been made from the title and journal. Papers 
by the same author may be found under more than one heading, but papers 
by more than one author are indexed in the section concerned under the name 
of each author, so as to avoid difficulty in tracing all papers by a given author. 
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Fertilizer Practice. 
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ae limits of, 191, 236. 


Error, sampling; see random sampling 
error and bias. 
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Errors, in observation and measurement, 


15, 106, in fieldwork, 106; in 
computations, 124; rounding off, 238 ; 


gross, 124, 354; grouping, 238; see 
also investigators, tests of. 

Estimates, alternative, 145. 

Estimation of population values, 145-182 ; 
tules for, 147; of sampling errors, 
183-245 ; of size of sample and relative 
efficiency, 94-99, 246-296; see also 
under types of sample. 

Experiments, 105, 131, 212. 

Explanatory notes (census forms), 103. 

ploratory surveys, 48, 99. 
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mouth, 96, 186, 239. 
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Fitting constants, 137, 201. 
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Frame, 20, 60-87, 144; defects of, 60, 
44; human populations, 62-78 ; 
economic institutions, 79; agriculture, 
81-87; forestry, 83-87 ; construction 
of second-stage, 34, 68, 7l; from 
censuses, 65; from lists, 63, 66, 70, 
79; from maps, 68-75, 79, 81-86 ; 
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Galvani, L., 40. 

Gamma function, 293. 
Gang-punching, 116. 
Geoffrey, L., 127. 
Geographical scope, 49, 141. 
Gini, C., 40. 

Glass, D. V., 130. 

Gray, P. S., 349. 

Greece, population census, 71. 
Gregory, W., 299. 

Gross errors, 124, 354. 
Grouping, 118, 186, 188, 238. 


Half-open interval, 67, 68. 

Hansen, M. H., 36, 65, 352, 358. 

Haphazard selection, 10. 

Healy, M. J. R., 355. 

Hertfordshire farms, samples for wheat 
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152, 159, 161, 162, 189, 205, 214, 216, 
220, 252, 258; stratified sample, 150, 
203, 249, 251; variable sampling 
fraction, 154, 207, 252, 255, 256, 340, 
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351; samples of parishes, 169, 226, 264, 
266; two-stage samples, 270, 271, 290. 
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Hurwitz, W. N., 36, 352, 358. 
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Inadequacy, in frame, 60, 61, 78. 

Incon'plete census, 2. 

Incomplete results, 10; adjustment for, 
129, 337. 

Incompleteness, in frame, 60. 

Independence, 196. 

Independent samples, 45. 

Index numbers, calculation of, 117. 

India, Calcutta Institute of Statistics, 83. 

Industrial undertakings, 48, 59, 79. 

Information, description of, 141 ; required, 
51; methods of collection, 57; 
practicability, 55. 

Insect populations, 44. 

Instructions, 103, 141. 

eera values of supplementary variate, 
217. 

Interactions, 105, 140, 211, 311. 

Interpenetrating samples, 44, 105, 107, 
143, 241, 242; examples, 83. 

Inter-relations between units, 54. 

Intra-class correlation, 267. 


Investigators, 58, 105; tests of, 44, 99, 
105, 107, 143, 241; instructions to, 
103 ; conditions of work, 106. 

Tong State College Statistical Laboratory, 

Italy, population census, 40. 


Jamaica, agricultural survey, 356. 
Jessen, R. J., 71, 73. 


Kempthorne, O., 71, 117. 
Keyfitz, N., 308, 333, 343, 358. 
King, A. J., 73. 


King, G. W., 355. 
Kirby, J., 327. 
Kiser, C. 1l. 
Kish, L., 358. 


Kraals, 159, 214. 


Land utilization surveys, 86. 

Latin square, 356. 

Lattice sampling, 356, 363. 

Least squares, 137, 201. 

Limits of error, 191, 236. 

Line sampling, 42, 85, 86; errors, 229; 
examples, 232. : 

Linear functions, standard errors of, 196. 

Listing, 114. 

Lists, see frame and systematic sample. 

Livestock, 81. 

Localized population survey, 75. 

Logits, 314. 

Lombard, H. L., 315. 

London School of Economics, 308, 343. 

Losses due to errors, 292, 339. 


M'Gonigle, G. C. M., 327. 

Madow, W., 358. 

Mahalanobis, P. C., 83, 92. 

Maps, use as frames, 68-75, 79, 81-86 ; 
areas from, see point and line sampling. 

Marginal categories, 49. 

Mark sensing, 109, 333. 

Market research, 79. 

Master cards, 116. 

Master sample, 65, 73, 75. 

Mathison, I., 81. 

Matrix inversion, 321. 3 

Mean, arithmetic, 145 ; rule for estimation 
of, 148 ; geometric, 145 ; working, 185 ; 
correction for, 185. 

Mean square, 206. 

Mean square deviation, 183. 


397 


INDEX 


Measurement, errors in, 15. 

Mechanical editing, 333. 

Median, 145. | oe 

Milk, composition of, 178, 181, 234, 235. 
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Ministry of Agriculture, 82, 128, 158, 
322; crop reporters, 324. 

Ministry of Home Security, 67. 

Morbidity, 11. 

Moving observer, 43. 

Multi-phase sample, 38; estimates, 157, 
159, 162; errors, 213, 219, 258; size 
and efficiency, 258, 286, 335. 

Multi-stage sample, 18, 34; estimates, 
170, 171l; errors, ; size and 
efficiency, 98, 268, 285; in lattice 
sampling, 362; examples, 71, 77, 81, 
84 


Multi-stage sample with uniform overall 
sampling fraction, 36, 148, 171, 278, 
287, 349, 350; examples, 71, 77; 
with adjustment of proportions of 
second-stage units, 78. 

Multiple classification, analysis of, 
131-141, 

Multiple punching, 119. 

Multiple stratification, 25, 254 ; examples, 
73, 77; without control of sub-strata, 


Multiplying punch, 117. 


National Agricultural Advisory Service, 
59, 324. 

National Farm Survey, 115, 117, 128, 158, 
301, 305, 306, 307, 353. 

National Register, 64, 

Natural units, 20 ; hierarchy of, 121, 

Non-response, 59, 107, 130; sub-sample 
for, 108. 

Norfolk—Portsmouth, Virginia, 186. 

Normal distribution, 190 ; sample from, 
149, 185, 188, 190, 192, 298. 

Normal equations, 320. 

Normal law of error, 190. 

Notation, 7, 146, 

Nuffield Trust, 78. 


Observation, errors of, 15, 

Observers, see investigators. 

Odds, 314. 

Opinion surveys, 79; effect of stratifica- 
tion, 248. 

Optimal allocation, 18, 28, 285, 338. 

Optimal values, 284. 

Ordnance Survey, 75, 82, 83, 84. 

Orthogonal polynomials, 321. 


Orthogonality, 211. 
Out-of-date frame, 60. 
Overall estimates, 175. 


Partial replacement, see successive 
occasions. 

Patterson, H. D., 179, 180, 315, 358. 

Percentage standard deviation, 96, 184. 

Percentage standard error, 95, 184. 

Percentages, choice of, 108; calculation 
of, 117; estimation of, 148; standard 
error of, 94, 193. 

Personnel, 142. 

Phase, see multi-phase. 

Pilot surveys, 48, 99, 273. 

Planning of surveys, 48-101, 246, 294. 

Point sampling, 35, 69, 82, 86; estimates, 
167; errors, 224; size and efficiency, 
262, 286; examples, 272. 

Pooled estimate of error, 205, 236. 

Pooling of classes, 137. 

Population, human, 121, 217; frames for, 
62-78 ; localized surveys of, 75-78; 
special classes, 78. 

Population, statistical, 20; finite, 187; 
to be covered, 49, 141; values, see 
estimation. 

Population census, Greece, 71 ; Italy, 40; 
Southern Rhodesia, 159, 214; U.K., 
53; U.S.A., 65; frame from, 65. 

Postal enquiry, 58, 107, 130, 

Potato survey, 131-141, 199, 209, 32 

Powers-Samas, see punched card ý 

Precision, relative, 246-283, 305, 
definition, 247. 

Precoding, 120. 

Preliminary com 
tions. 

Preliminary count (of dwellin 

Preliminary estimates, 85, 92. 

Printing, 113. 

Probability of selection Proportional to 
size, sample with, 35, 36; estimates, 
167, 169; errors, 224, 225; size and 
efficiency, 262; selection of, 35, 347; 
examples, ¢ » 77; in multi-stage 

sampling, see multi-stage sample with 
uniform overall sampling fraction ; 
see also point sampling and variable 
probability. 

Probits, 314. 

Progressive digiting, 115. 

Progressive totals, 115. 

Proportions, 303; see also er oS. 

Public opinion polls, 79, 299. ORES 

Punched cards, 109, 112-123, 126, 333 
35 CKS 


353; 


putations, see computa- 
a 


gs), 68. 


Punching, 109, 112, 120, 126. 
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‘Purpose of survey, 49, 51, 141. 
Purposive selection, 40, 80, 142. 


Qualitative variates, 94, 193, 232, 248, 
3ł4; rule for, 148. 

Quality control, 48. 

Quenouille, M. H., 319, 355. 

Questionnaires, 103-105 
of, 99, 104, 10 postal, 58, 107. 

Questions, wording of, 103. 

Quota method, 80, 142, 299. 


tests 


& 


Rabbit damage, 344. 

Raising factor, 147; overall, 170. 

Random numbers, 21, 297. 

Random sample, 10, 21; estimates, 145, 
148, 152, 159, 162, 297; errors, 183- 


196, 212, 217, 218, 297; and 
efficiency, 94, 248, 249, 256, 353; 
examples, 31, 83. 

Random sampling error, 2, 9, 17; 


estimation of, 183-245; by sampling, 
238; from duplicate samples, 242; 
presentation of, 243; see also under 
types of sample. 

Random selection, 21; examples, 22. 

Random selection from areas, 22. 

Randomized blocks, 105. 

Rating offices, 66. 

Ration books, 64. 

Ratios, rule for estimation of, 148; 
standard error of, 198, 212; calcula- 


tion of, 117; use in investigational 
work, 317; use in detection of gross 
errors, 354; see also supplementary 


information. 

ReadpD. R., 373. 

Rees, D. H., 386. 

Regression, 155, 199, 219, 313; multiple, 
320, 327; curvilinear, 321; grouped 
data, 319; effect of random errors, 

use in investigational work, 317- 

use in detection of gross errors, 

i see also supplementary informa- 
tion and calibration of eye estimates. 

Rejection of observations, 354. 

Rents, 158, 

Repeated surveys, 17, 
successive occasions. 

Reports, 141. 

Representative sample, 9, 84. 

Reproducing punch, 116. 

Response, failure of, see non-response. 

Road network, sampling of, 363. 

Road transport, 334, 362. 

Robinson, H. L., 333, 358. 

Rolling total tabulator, 114. 


79; see also 


Rotational sampling schemes, 356. 
Rounding off, 118, 238, 

Rowntree, B. Seebohm, 4. 

Royal Commission on Population, 64, 130. 


Sample, types of, 20-47; see also under 
separate types, 
Sample census, 2. 
Sample survey, 4. 
Sampling error, 

error and bias. 

Sampling fraction, 18, 23, 24, 147, 148. 

Sampling process, 1; in censuses and 
surveys, 2-6; census, incomplete 
sample, 2; survey, sample, 4. 

Sampling units, 20 ; choice of, 19; multi- 
stage, 34; inter-relations between, 54 ; 
variation in size of, 19, 98, 279; rule 
for estimation of number in population, 
147. 

Scoring, 147, 148, 195. 

Selection, methods of, 10, 21, 29, 142, 334. 
Selection with probability proportional 
to size, see probability of selection. 

Shaul, J. R. H., 160. 

Sheppard's correction, 239. 

Shops, 79. 

Sickness, 12. 

Size, probability proportional 
probability of selection. 

Size of sample, determination of, 94-101, 
246-296; see also under types of 
sample. 

Size of strata, variation in, 98, 280. 

Size of unit, effect on sampling error, 19, 
98; variation in, 99, 279. 

Snedecor, G., 105, 138. 

Social Survey, 78. 

Soil analysis, 82. 

Soil temperatures, 282. 

Solids-not-fat, see milk. 

Sorter, 113. 

Sorter-counter, 113. 

Sorting, 110. 

Southern Rhodesia, 159, 214. 

Spence, J. C., 327. 

Stage, see multi-stage. 

Standard deviation, 96, 183, 190; per- 
centage, 96, 184. 

Standard error, 32, 94, 183; percentage, 
95, 184; of qualitative variates, 94, 
193; of mean, 96, 184, 187; of total, 
96, 184, 187; of ratio, 198, 212; of 
multiple, 196; of product, 198; of 
sum, 197; of difference, 196; of linear 
function, 196; of weighted mean, 197; 
of standard deviation, 192; effect of 
lack of independence, 198; see also 
under types of sample. 


see random sampling 


to, see 
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Standardization, 157, 158, 159, 162, 213, 
219, 318. A 

Statistical analysis, see analysis. 

Statistician, functions of, 6, 49. 

Stephan, F. F., 65. 

Stevens, W. L., 138. 

Stock, J. S.,. 77. 

Stones, sampling of, 12. 

Stratification after selection, 25, 32, 152, 
205. 

Stratification, 
stratification. 

Stratified sample, variation in size of 
strata, 98, 280. 

Stratified sample with one unit per 
stratum, 24, 78, 280. 

Stratified sample with uniform sampling 
fraction, 17, 23, 146; estimates, 150, 
160, 164; 
300, 303 ; 


multiple, see multiple 


Stratified sample with variable sampling 
fraction, see variable sampling fraction. 

Streets, sampling by, 67, 69. 

Sub-Commission on Statistical Sampling, 
141. 

Sub-sample for non-response, 60; on 
successive occasions, 45; see also 
multi-phase sample. 

Sub-totals, 115. 

Substitution, 10, 108. 

Successive occasions, sampling on, 17, 45; 
estimates, 175, 179 ; €rrors, 233; size 
and efficiency, 260; example, 77. 

Sukhatme, P. V., 15, 255. 

eee squares, 185, 206 ; calculation of, 


Summary punch, 116. 

Supervision, 19, 105. 

Supplementary information, 18, 32, 38, 
98, 145, 146; ratio method, 71, 155-162, 
171-174, 198, 212-218, 256: double. 
ratio method, 343 ; regression method, 
155, 162-165, 171, 218-222, 
effects of errors in, 213. 

Survey, definition, 4. 

Survey of Fertilizer Practice, 57, 81, 111, 
123, 171, 227, 240, 257, 264, 291, 295, 
346. 

Syracuse, U.S.A., 11. 

Systematic sample from areas, 10, 41; 
estimates, 174; errors, 229; size and 
efficiency, 282 ; example, 83. 

Systematic sample from a list or card 
index, 10, 29; estimates, 174; errors, 
229, 366; selection of, 334 ; examples, 
64, 65, 67, 81. 


256; 


t distribution, 192. 

Tabulation, machine, 114. 

Tabulator, 113. 

Telephone enquiries, 80. 

Temperatures, soil, . 

Tepping, E. J., 127, 358. 

Terminology, 7. 

Tests of questionnaires, 99, 104, 105; of 
investigators, 99, 105, 241; of 
significance, 188, 200. 

Thomas, G., 55. 

Timber, see Census of Woodlands. 

Totals, computation of, 109, 110, 111, 
113; rule for estimation of, 147. 

Tracks, 70. 

Trailers, 116, 120. 

Training, 99, 105 

Transformations, 236, 314. 

Travelling, 19. 

Two-nine feature, 120. 

Two-way tables, analysis of, 131. 


U.S.A. employment estimates, 76 ; master 
sample, 73; population census, 65, 
334 ; presidential elections, 80, 334. 

Undeveloped areas, 34, 42, 296; frames 
for, 70, 85-87. 

Unemployment, 76. 

Uniform sampling fraction, 23; overall, 
see multi-stage sample with uniform 
overall sampling fraction, 

Unit, natural, see natural units, 

Unit, sampling, see sampling units. 

United Kingdom, effects of air raids, 67; 
Family Census, 53, 64, 130; localized 
SURVEYS 76; Population Census, 53, 
365. 

United Nations, 141 A Fooa and 
Agriculture Organization, 373; 
Economic Commission for Europe, 363. 


Variable probability, sampling with, 352. 

Variable sampling fraction, sample with, 
18, 28; estimates, 153, 161, 164; 
errors, 201, 205, 216, 221, 300, 303; 
size and efficiency, 98, 254, 256, 305, 
ae cpanel allocation, 18, 28, 285, 

> Selection of, 334; ex 

yal 81, 115, 128, 171. ees) 
ariance, 183; unequal, 2 2 

Variate, 146. i Mae 

Variation, coefficient of, 184, 

Vehicles, sample of, 334, 

Villages, 70. 
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Weighted mean, 17; of sub-class means, 
134; of differences of sub-class means, 
136 ; standard error of, 197. 

Weighting factors, 108, 123. 

Wheat, 13, 15, 166, 223, 273, 344, 347; 
see also Hertfords' farms. 

Wireworms, 236. 

World statistics, 51. 
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Yates, F., 12, 13, 81, 138, 211, 237, 268, 
283. 

Yield per acre, see crop estimation and 
crop forecasting; bias in, 16. 


z transformation, 314. 
Zacopanay, I., 268. 
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